dataengineering Archives - prodSens.live

What I Learned Cleaning 1 Million Rows of CSV Data Without Pandas

Sudha Ranganathan — Sat, 31 May 2025 12:20:22 +0000

Cleaning a small CSV? Pandas is perfect.
Cleaning up a million rows on a limited machine or using a serverless function? That’s when Pandas really struggles.

That’s exactly the problem I faced.

In this post, I’ll share:

Why I avoided Pandas
My Node.js pipeline with csv-parser
How I handled common data issues: dates, phone numbers, missing fields
What I’d do differently next time

Let’s dive in.

Why Not Pandas?

Pandas is fantastic, but it does have a downside: it loads the entire file into memory. If you’re working with files larger than about 500MB, you might run into some issues.

Memory errors
Slow performance
Crashes in limited environments (e.g., cloud functions, small servers)

In my case, I had:

1 million+ rows
Dirty data from multiple sources
A need to stream and clean data row by row

My Setup: Streaming CSV Processing in Node.js

Here’s the core pipeline using csv-parser and Node streams:

const fs = require('fs');
const csv = require('csv-parser');

let rowCount = 0;
let errorCount = 0;

fs.createReadStream('bigfile.csv')
  .pipe(csv())
  .on('data', (row) => {
    rowCount++;

    // Clean data
    row.email = cleanEmail(row.email);
    row.phone = cleanPhone(row.phone);
    row.date = parseDate(row.date);

    // Validate required fields
    if (!row.email || !row.date) {
      errorCount++;
      logError(row);
      return;
    }

    // Save row to DB, another file, or API...
  })
  .on('end', () => {
    console.log(`✅ Processed ${rowCount} rows`);
    console.log(`⚠  Found ${errorCount} bad rows`);
  });

function cleanEmail(email) {
  return email?.trim().toLowerCase() || null;
}

function cleanPhone(phone) {
  const digits = phone?.replace(/D/g, '');
  return digits?.length === 10 ? digits : null;
}

function parseDate(date) {
  const parsed = Date.parse(date);
  return isNaN(parsed) ? null : new Date(parsed).toISOString();
}

function logError(row) {
  fs.appendFileSync('errors.log', JSON.stringify(row) + 'n');
}

Common Data Issues I Ran Into (and How I Fixed Them)

Inconsistent date formats (MM-DD-YYYY vs DD/MM/YYYY) → Used Date.parse() + fallback logic.
Phone numbers in weird formats → Removed non-digits, validated length
Missing fields → Set defaults or marked as null
Extra columns → Stripped to schema fields
Encoding problems → Saved CSVs as UTF-8

Pro Tips for Large CSV Cleaning

Stream, don’t load → Avoid memory issues by processing row by row
Validate early → Catch bad data before it pollutes your system
Log errors → Keep a separate file of rejected rows for review
Test on a small sample → Always test your logic before full-scale runs
Handle edge cases → Empty cells, extra commas, inconsistent headers—these will happen!

What I’d Do Differently Next Time

Use a schema definition (like JSON Schema or Zod) to validate and transform rows automatically
Build a mapping layer for multi-source CSVs (e.g., different column names)
Consider tools like DuckDB or Polars if I need more advanced queries

Final Thoughts

Handling big data files involves more than just coding; it’s about crafting durable pipelines that can navigate the complexities and messiness of real-world scenarios.

If you’re working with CSVs, remember:

Validate early
Clean thoughtfully
Log everything

And when in doubt, stream it, don’t load it all at once.

Have you ever tackled the challenge of cleaning up a huge dataset? What tools or tips have you found to be the most helpful? I’d love to hear your thoughts!

The post What I Learned Cleaning 1 Million Rows of CSV Data Without Pandas appeared first on prodSens.live.

Introduction to Data Engineering Concepts |18| The Power of Dremio in the Modern Lakehouse

Ayelet Mechany — Fri, 02 May 2025 20:21:06 +0000

Free Resources

As organizations shift toward data lakehouse architectures, the question isn’t just how to store massive volumes of data—it’s how to optimize it for fast, reliable access without adding complexity or operational overhead. Dremio addresses this challenge head-on by combining performance, governance, and openness into a platform built natively on Apache Iceberg, Apache Arrow, and Apache Polaris.

In this final post of our series, we’ll explore how Dremio ties together the technologies we’ve discussed—like clustering, reflections, and cataloging—into an integrated solution for modern data engineering. We’ll cover what makes Dremio unique, how its latest innovations like Iceberg Clustering and Autonomous Reflections work, and why these capabilities are a breakthrough for data teams aiming to do more with less.

Built for the Modern Stack

Dremio isn’t just a SQL engine—it’s a full data platform built for the lakehouse era. It operates directly on data stored in open formats like Parquet and Iceberg, using Apache Arrow for in-memory performance and Apache Polaris for metadata management and governance. The result is a platform that offers sub-second queries, native support for open standards, and a unified experience across ingestion, transformation, exploration, and security.

Instead of requiring teams to move data into a proprietary warehouse, Dremio enables query federation across lakes, catalogs, and traditional databases. Whether your data lives in S3, GCS, Azure, or multiple warehouses, Dremio can connect, query, and govern it—all without duplication or data movement.

But what truly sets Dremio apart is its focus on intelligent automation and data layout optimization. Let’s break down how these features work.

Iceberg Clustering: Smarter Data Organization

As datasets grow, traditional partitioning strategies fall short. Over-partitioning leads to a flood of small files. Under-partitioning causes massive scan overhead. Dremio introduces Iceberg Clustering to address this gap.

Instead of dividing data into rigid partitions, clustering organizes rows based on column value proximity using Z-ordering, a type of space-filling curve. This technique braids together bits from multiple columns to form an index that preserves locality. The closer the index values, the closer the original rows were in value space—making it easier for the engine to skip irrelevant data.

By clustering non-partitioned tables, Dremio can dramatically reduce the number of data files and row groups scanned during queries. The result: faster performance without the rigidity or complexity of traditional partitioning.

This process is incremental and adaptive. Dremio monitors data file overlap (measured via clustering depth) and selectively rewrites files to restore efficient layout. You don’t have to re-cluster everything or worry about perfect partition granularity—Dremio handles it dynamically and intelligently.

Autonomous Reflections: AI for Query Optimization

Materialized views are great—until you have to decide which ones to create, maintain, and drop. Dremio automates this process with Autonomous Reflections, which monitor your workloads, identify performance bottlenecks, and generate pre-aggregated or pre-filtered views to accelerate queries.

The system analyzes usage patterns and query plans, scores potential reflections based on estimated time savings, and creates only those that deliver meaningful impact. It even keeps them up to date using live metadata refresh and incremental updates, ensuring performance gains without sacrificing freshness.

Reflections are created, scored, and dropped automatically based on cost-benefit analysis, with strict guardrails to avoid wasting resources. This isn’t just automation—it’s intelligent, usage-aware optimization.

With Dremio’s Autonomous Reflections, query acceleration becomes invisible to the user. Queries run faster, dashboards load quicker, and teams no longer need to guess which workloads justify a materialized view. The platform adapts as your usage changes.

Governance and Discoverability with Polaris

Managing Iceberg tables at scale requires more than just metadata tracking—it requires unified governance. Dremio’s integration with Apache Polaris gives teams a central catalog that enforces access controls, tracks lineage, and supports multi-engine access through open REST protocols.

Whether you’re using Spark, Trino, Flink, or Dremio itself, Polaris provides a consistent layer for managing catalogs, namespaces, and Iceberg tables. Service principals and RBAC ensure secure access, while credential vending allows query engines to read data without exposing cloud credentials.

By offering a unified metastore for all your Iceberg assets, Polaris makes it easier to scale governance and integrate with diverse compute engines, all while maintaining data sovereignty and visibility.

AI-Ready Data, Out of the Box

As data volumes soar and AI workloads increase, organizations need data platforms that deliver speed and clarity—not maintenance overhead. Dremio’s new features don’t just optimize query performance; they also support AI and analytics with intelligent automation, semantic search, and unified metadata.

AI-Enabled Semantic Search lets users discover datasets using plain language, not SQL. This reduces time spent hunting for data and accelerates exploration for analysts and data scientists alike. Combined with reflections and clustering, the platform ensures these queries return results fast.

And because Dremio is built on open standards—Iceberg, Arrow, and Polaris—you can trust that your data architecture will remain portable, interoperable, and vendor-neutral.

Real-World Results

Dremio has already demonstrated the power of this approach internally. After deploying clustering and autonomous reflections across its own internal lakehouse, Dremio saw:

80% of dashboards accelerated automatically
10x reduction in 90th percentile query times
30x improvement in CPU efficiency per query
Substantial infrastructure savings by right-sizing compute resources

These improvements weren’t the result of hand-tuning or custom engineering. They were achieved through intelligent automation—something every team can now access.

Conclusion

Data lakehouses offer unmatched flexibility, but performance and manageability have long remained pain points. With features like Iceberg Clustering, Autonomous Reflections, and Polaris Catalog, Dremio turns the lakehouse into a high-performance, governed, and self-optimizing platform.

For data engineers, this means fewer manual interventions, faster time-to-insight, and greater confidence in how data is delivered. For analysts and AI teams, it means sub-second queries and easy access to the data they need—no pipeline delays, no tuning required.

As the final stop in this series, Dremio represents the culmination of modern data engineering principles: openness, automation, and efficiency. If you’re building on Iceberg and want to unlock its full potential, Dremio offers a platform designed not just to support your architecture, but to elevate it.

To see it in action, try Dremio for free or explore the latest launch to learn how these capabilities can help your team build a faster, smarter lakehouse.

The post Introduction to Data Engineering Concepts |18| The Power of Dremio in the Modern Lakehouse appeared first on prodSens.live.

Pair and Transpose Adjacent Records within the Group - From SQL to SPL #13

Louise Persson — Wed, 26 Mar 2025 02:20:28 +0000

Problem description & analysis:

A certain table stores records of personnel from external sources entering and leaving a building by swiping their cards in recent days. When a person enters and exits a building, there are usually N pairs of records, with one in and one out for each pair. Sometimes the data is not standardized or paired, with only one in or one out, or multiple consecutive ins or outs.

Task: Now we need to convert each pair of records from rows to columns and turn them into a single record. Unmatched data should be converted into a separate record, and the blank part should be filled with null.

Code comparisons:

We can first group by person and building, and then divide every 2 adjacent rows within the group into a small group and transpose each small group. But after SQL grouping, it must aggregate immediately, and subsets cannot be kept to group again. This requires a shift in thinking and using multi-layer nested window functions to bypass this problem, which is difficult to code. Grouping by adjacent IN/OUT states is another hassle, requiring the use of multiple layers of PARTITION to generate accumulated values for grouping. What’s even more troublesome is that it must immediately aggregate after grouping, and the grouped data cannot be further transposed, which is still cumbersome.

SPL solution:

After SPL grouping, the grouped subsets can be retained, and conditional ordered grouping is also supported. It is convenient to divide a pair of IN/OUT records into a group, and adjacent references are also used in this process. SPL can continue to keep grouped subsets, making it easier to transpose each group.

A1： Load data from the database and sort it by username, building, timestamp.

A2： Use the group function to group by person and building, but do not aggregate.

A3： For each group of data in A2, divide each group of IN and OUT (including individual IN and OUT) into a small group, and merge the large group to leave only the small group. The option @i of the group function indicates grouping by condition. ~ is the current group, and [-1] represents the previous record.

A4： Convert each group of data in A3 from rows to columns, union each group and leave only records. The pivot function is used for transposition, and the column names after transposition can be specified as IN and OUT to ensure that null is automatically filled in case of missing data.

esProc SPL is FREE to download : esProc SPL FREE Download

The post Pair and Transpose Adjacent Records within the Group - From SQL to SPL #13 appeared first on prodSens.live.

Study Notes 6.3-4: What is Kafka & Confluent Cloud

Oliver Smith — Tue, 18 Mar 2025 15:21:20 +0000

1. Introduction to Kafka in Stream Processing

Context of Stream Processing:
- Stream processing involves continuously handling data as it flows from producers to consumers.
- Kafka is positioned as a central streaming architecture that acts like a high-performance, distributed notice board for messages.
Kafka’s Role:
- Kafka serves as a robust, scalable, and flexible messaging system.
- It enables real-time data exchange among producers (data senders) and consumers (data receivers) in distributed systems.

2. The Notice Board Analogy

Basic Analogy:
- Imagine a notice board where producers attach flyers containing information. Consumers (subscribers) read and act upon these flyers.
- In Kafka, “flyers” are messages and the “notice board” is a topic.
Topic Concept:
- A topic in Kafka represents a continuous stream of events.
- Examples:
  - A temperature monitoring application might send an event every 30 seconds (temperature reading with a timestamp) to a topic.
- All events for a particular subject (e.g., room temperature) are appended to the same topic, forming a time-ordered log.

3. Understanding Kafka’s Message Structure

Event as a Message:
- Each event (or message) typically includes:
  - Key:
    - Used for determining message routing and partitioning.
    - Helps ensure that related events are processed together.
  - Value:
    - Contains the actual payload or data (e.g., temperature reading, log entry).
  - Timestamp:
    - Marks the time at which the event occurred or was recorded.
Storage as Logs:
- Messages are stored in a log-like structure, which provides a durable, ordered sequence of events.
- This design differs from typical database storage (e.g., B-trees) and is optimized for append-only, high-throughput writes.

4. Key Features and Benefits of Kafka

Robustness and Fault Tolerance:
- Replication:
  - Kafka replicates data across multiple nodes (brokers), ensuring that even if some servers fail, data remains available.
- Reliability:
  - Data is never “lost” as messages are retained for a configurable period, regardless of whether consumers have read them.
Scalability:
- High Throughput:
  - Can handle increases from 10 events per second to thousands, supporting large-scale applications.
- Partitioning:
  - Topics are split into partitions, allowing parallel processing and load distribution.
- Consumer Groups:
  - Multiple consumers can read from a topic concurrently without processing the same message more than once within a group.
Flexibility:
- Integration Options:
  - Supports integrations with tools like Kafka Connect (for moving data between Kafka and external systems), ksqlDB (for stream processing using SQL-like queries), and tiered storage solutions.
- Retention Policies:
  - Allows you to configure how long messages are stored, making it possible to re-read data for offline analysis or recovery.
- Multi-Consumer Capability:
  - Messages remain available after being read by one consumer, enabling multiple independent consumers to process the same data stream.

5. Kafka in Modern Architectures

From Monoliths to Microservices:
- Historical Context:
  - Traditional monolithic architectures often relied on direct database access for data exchange.
- Microservices Environment:
  - With the rise of microservices, there’s a need for a centralized communication channel. Kafka acts as a decoupled event bus.
  - Producers (individual microservices) publish events to Kafka topics; other microservices subscribe and react to these events.
Data Exchange Between Diverse Systems:
- Kafka facilitates communication not only between microservices but also between legacy systems (monoliths) and new applications.
- It acts as a bridge during transitional phases where both architectures coexist.

6. Change Data Capture (CDC) and Kafka Connect

What is CDC?
- Change Data Capture (CDC) refers to the process of capturing and delivering changes made in a database to downstream systems.
Kafka’s Role in CDC:
- Kafka Connect, a component of the Kafka ecosystem, enables CDC by capturing database changes and publishing them as Kafka messages.
- This allows applications and microservices to react in real time to changes in the underlying database, further unifying different data sources.

7. Practical Considerations and Additional Insights

Message Partitioning and Ordering:
- The key in a Kafka message is crucial for ensuring that related messages are sent to the same partition.
- Ordering is maintained within a partition but not across partitions, which is a design trade-off for scalability.
Configuration Nuances:
- Kafka’s behavior can be fine-tuned through various configuration parameters (e.g., replication factor, retention time, partition count).
- Proper configuration is essential for optimizing performance, fault tolerance, and data consistency.
Operational Best Practices:
- Monitor Kafka cluster health (broker availability, lag in consumer groups).
- Plan for capacity and scaling in production environments.
- Use schema registries to manage evolving message formats and ensure data compatibility.
Comparisons to Other Streaming Platforms:
- Unlike some other streaming APIs, Kafka’s log-based architecture, combined with its replication and scalability features, makes it particularly well-suited for large, distributed systems.
- Its integration ecosystem (Kafka Connect, ksqlDB) adds extra flexibility, making it a popular choice in the streaming space.

8. Summary and Real-World Applications

Kafka as a Central Hub:
- Acts as the backbone of modern streaming architectures.
- Provides a unified, reliable, and scalable platform for data exchange across diverse systems and services.
Real-World Use Cases:
- Event Sourcing: Recording every state change as an event.
- Monitoring and Logging: Aggregating logs and metrics from multiple sources.
- Real-Time Analytics: Enabling immediate insights from continuous data streams.
- Integration and CDC: Bridging legacy systems with modern applications using real-time data feeds.

Confluent Cloud

Confluent Cloud is a fully managed, cloud-native service for Apache Kafka, enabling data engineers to focus on building real-time streaming applications without the complexities of infrastructure management. This guide provides a comprehensive walkthrough of setting up Confluent Cloud, creating Kafka clusters and topics, producing and consuming messages, and integrating connectors, supplemented with best practices for both beginners and professional data engineers.

1. Setting Up Confluent Cloud

To begin, sign up for a Confluent Cloud account. New users often benefit from a free trial period, which allows exploration of the platform’s features without immediate financial commitment.

2. Creating a Kafka Cluster

After logging in, the first step is to create a Kafka cluster:

Add a Cluster: Navigate to the “Clusters” section and select “Add cluster.”
Cluster Type: Choose the “Basic” cluster option for testing and development purposes.
Cloud Provider and Region: Select your preferred cloud provider (e.g., AWS, GCP) and a region close to your operations to minimize latency.
Cluster Configuration: Provide a name for your cluster and review the configuration settings.
Launch Cluster: Finalize the setup by launching the cluster. Provisioning may take a few minutes.

For detailed instructions, refer to Confluent’s official documentation.

3. Generating API Keys

To interact securely with your Kafka cluster, generate API keys:

Access API Keys: Within your cluster’s dashboard, navigate to the “API keys” section.
Create API Key: Generate a new key with appropriate access levels.
Store Credentials: Securely store the API key and secret, as they are required for client applications to authenticate with the cluster.

4. Creating Kafka Topics

Topics are fundamental to Kafka’s architecture:

Add Topic: In the cluster interface, select “Topics” and then “Add topic.”
Configure Topic: Specify the topic name, number of partitions (e.g., 2 for testing), and retention settings (e.g., 1 day) to manage storage costs.
Finalize: Create the topic with the configured settings.

5. Producing and Consuming Messages

With the topic set up, you can produce and consume messages:

Produce Messages: Use the Confluent Cloud interface or client libraries (e.g., Java, Python) to send messages to the topic.
Consume Messages: Similarly, set up consumers to read messages from the topic, enabling real-time data processing.

6. Integrating Connectors

Confluent Cloud offers connectors to integrate with various data sources and sinks:

Add Connector: Navigate to the “Connectors” section and select “Add connector.”
Choose Connector: For testing, the “Datagen Source” connector can generate mock data.
Configure Connector: Set the output topic (e.g., “tutorial”) and data format (e.g., JSON).
Launch Connector: Deploy the connector to start streaming data into your Kafka topics.

7. Best Practices for Data Engineers

To optimize your use of Confluent Cloud:

Security: Implement robust security measures, including encryption, authentication, and authorization, to protect data integrity and privacy.
Monitoring: Utilize Confluent’s monitoring tools to track system performance, identify bottlenecks, and ensure system reliability.
Scalability: Design your data pipelines with scalability in mind, allowing for seamless scaling as data volumes grow.
Cost Management: Regularly review resource utilization and optimize configurations to manage costs effectively.

For an in-depth exploration of best practices, consider reviewing Confluent’s recommendations for developers.

8. Additional Resources

To further enhance your understanding and skills:

Confluent Cloud Quick Start Guide: A step-by-step tutorial to get started with Confluent Cloud.
Confluent Cloud Examples: Explore practical examples and tutorials to deepen your knowledge.
Community Discussions: Engage with the data engineering community on platforms like Reddit to share experiences and insights.

The post Study Notes 6.3-4: What is Kafka & Confluent Cloud appeared first on prodSens.live.

🤯 #NODES24: a practical path to Cloud-Native Knowledge Graph Automation & AI Agents

Emma Stratton — Tue, 07 Jan 2025 20:20:21 +0000

About IT system cartography

Four years ago I spoke about how to cartography an information system with data…which opened me the opportunity to talk about the resulting journey – and benefits – at #NODES22 (see NODES22 dedicated DEV.to series):

Our speech about “IT holism” at #nodes22

adriens for opt-nc ・ Dec 5 ’22

#neo4j
#datascience
#datavisualization
#github

Then, one year after I saw the rise of GenAI, and as a Neo4J Ninja and a nerdy Open Data Aficionado, I wanted to give it a try on the concrete case of the UNG, the United Nations (Sustainable) Goals… and one more time it offered me the opportunity to talk about that at #NODES23 (see dedicated series on DEV.to):

Discover a country UN SDGs concerns w/ Open Metadata on Neo4J

adriens ・ Nov 13 ’23

#datascience
#dataviz
#opendata
#ai

During this experience I started to see more what genAI could offer to data (data generation, RAG and smart ChatBots), what Agentic could look like…

Ppretty soon I felt the need to see what could happen if I was putting together both approaches :

Corporate IT System Cartography
GenAI & Agentic approaches

I submitted my idea to #NODES24 CFP (Call For Paper)… and for the third year in row I had the chance to be part of it!

`#NODES24` 28 seconds teaser

// Detect dark theme
var iframe = document.getElementById(‘tweet-1850292121305321680-789’);
if (document.body.className.includes(‘dark-theme’)) {
iframe.src = “https://platform.twitter.com/embed/Tweet.html?id=1850292121305321680&theme=dark”
}

// Detect dark theme
var iframe = document.getElementById(‘tweet-1852585496708338077-437’);
if (document.body.className.includes(‘dark-theme’)) {
iframe.src = “https://platform.twitter.com/embed/Tweet.html?id=1852585496708338077&theme=dark”
}

`#NODES24` 30′ session

Short Ideas

During #NODES24 I shared some key points we discovered while building the Agentic approach. I shared some of them as tweets, each one embedding a short (< 1') video.
See below for them:

Use LLM feedbacks for better API designs

// Detect dark theme
var iframe = document.getElementById(‘tweet-1868779867871002693-697’);
if (document.body.className.includes(‘dark-theme’)) {
iframe.src = “https://platform.twitter.com/embed/Tweet.html?id=1868779867871002693&theme=dark”
}

API and Knowledge Graph thoughts experiments

// Detect dark theme
var iframe = document.getElementById(‘tweet-1862690949534818565-460’);
if (document.body.className.includes(‘dark-theme’)) {
iframe.src = “https://platform.twitter.com/embed/Tweet.html?id=1862690949534818565&theme=dark”
}

The tremendous impacts of “good” documentation

// Detect dark theme
var iframe = document.getElementById(‘tweet-1875396513708564743-522’);
if (document.body.className.includes(‘dark-theme’)) {
iframe.src = “https://platform.twitter.com/embed/Tweet.html?id=1875396513708564743&theme=dark”
}

The post 🤯 #NODES24: a practical path to Cloud-Native Knowledge Graph Automation & AI Agents appeared first on prodSens.live.

🚀 Beyond Data Ingestion: Advanced Strategies for Optimizing API Data Pipelines

Alexandra Ilie — Fri, 29 Nov 2024 07:20:22 +0000

In my previous blog, we explored how to build a dynamic and robust data ingestion pipeline using Azure Data Factory. The feedback from the community was overwhelming, and it sparked some amazing discussions! Thanks to the curiosity and questions from my peers, I’m excited to share this second installment, diving deeper into the advanced challenges and solutions for optimizing API-driven data pipelines.

If you haven’t read the first post, no worries — this one stands alone! However, if you’re curious about schema alignment, dealing with duplicate data, and the foundational aspects of building a smarter data pipeline, feel free to check it out here.

Key Lessons From Real-World Scenarios

After publishing the first post, I received insightful questions on topics ranging from schema alignment to managing API timeouts for massive datasets. In this blog, I’ll share the answers and some bonus learnings we uncovered along the way.

1. Schema Alignment Simplified

One major hurdle in data pipelines is ensuring schema alignment across systems. Here’s how we tackled it:

Fetching the Schema: The API we used provided a dedicated endpoint to retrieve schema details. By validating this schema in SQL Server Management Studio (SSMS), we ensured that the table structures, data types, and constraints matched the database requirements.
Handling Timeouts: Schema updates for large datasets often resulted in timeouts. To address this, we temporarily scaled up database resources, which significantly reduced the time required to save schema changes. Once the updates were complete, resources were scaled back to their original configuration to avoid unnecessary costs.

2. Automating Duplicate Handling

Data duplication is a common issue, especially when ingesting large datasets. While manual identification and deletion are possible, automation is key for efficiency. Here’s the SQL query we used:

WITH DuplicateRecords AS (
    SELECT recordId,
           ROW_NUMBER() OVER (PARTITION BY recordId ORDER BY recordId) AS row_num
    FROM my_table
)
DELETE FROM my_table
WHERE recordId IN (
    SELECT recordId
    FROM DuplicateRecords
    WHERE row_num > 1
);

Pro Tip:

While this query works for one-time cleanup, you can integrate it into a stored procedure or pipeline step for ongoing automation. This ensures data remains clean without manual intervention.

3. Optimizing Throughput Without Over-Scaling Hardware

Ingesting large datasets efficiently can be tricky, especially when hardware scaling only yields marginal improvements. In our case, the initial approach involved upscaling resources, but the gains were not significant. We reverted to the original plan and focused on the following strategies:

Batch Processing: Restructure the pipeline to handle larger chunks of data in fewer API calls, thereby reducing network overhead.
Parallelization: Execute multiple data ingestion operations in parallel to utilize existing hardware more effectively.
Switching Tools: Replace the web activity with more efficient tools, such as Azure Function Apps or Logic Apps, which can handle larger payloads per request.
Data Compression: If the API supports compression (e.g., GZIP), use it to reduce payload size and processing time.

By combining these techniques, we improved throughput without scaling hardware, achieving a cost-effective and efficient solution.

4. Tackling API Timeouts for Long-Running Data Retrieval

Timeouts are a common challenge when working with APIs, especially during large-scale data ingestion. To address this:

API Documentation Review: We thoroughly reviewed the API’s documentation to understand its timeout settings and limitations. However, it did not provide explicit solutions for handling prolonged data retrieval.
Handling Specific Timeout Scenarios: During one operation involving a particularly large dataset, a timeout occurred after processing a significant portion of the records. To manage this, we implemented the following strategies:
- Dynamic Pagination: Breaking the data retrieval into smaller, paginated chunks to keep requests within the timeout limits.
- Incremental Fetching: Adjusting request sizes dynamically based on observed timeout patterns to avoid failures.
- Retry Mechanism: Adding retry logic in the pipeline for automatic recovery from temporary failures.

These adjustments helped stabilize the pipeline, allowing for reliable ingestion even in the face of API limitations.

Key Takeaways

Building and optimizing data pipelines is a continuous learning process. The strategies shared here were shaped by real-world challenges, experimentation, and collaboration with my peers.

Have you faced similar challenges? Got better ideas or insights? I’d love to hear your thoughts and experiences in the comments below!

Let’s Continue the Conversation

If you haven’t read my first blog yet, check it out for insights into building a dynamic data pipeline. And if you enjoyed this post, don’t forget to like, share, and leave your feedback!

Together, let’s master the art of building smarter pipelines!

The post 🚀 Beyond Data Ingestion: Advanced Strategies for Optimizing API Data Pipelines appeared first on prodSens.live.

OLAP (Online Analytical Processing)

Tasmin Lofthouse — Sat, 16 Nov 2024 21:22:31 +0000

OLAP (Online Analytical Processing) is a technology that enables analysts to extract and query data interactively from multidimensional data warehouses. It provides a way to analyze complex datasets for decision-making, typically in business intelligence (BI) applications.

Definition of OLAP

OLAP is a system for organizing large business databases and supporting complex analysis. Unlike OLTP (Online Transaction Processing), which focuses on fast, real-time transactional operations, OLAP emphasizes analytical operations such as summarizing, aggregating, and comparing data across multiple dimensions.

Core Concept of OLAP

At its core, OLAP uses a multidimensional data model, often referred to as a “cube.” This cube allows data to be organized and visualized in multiple dimensions, such as:

Time (e.g., Year, Quarter, Month)

Geography (e.g., Country, Region, City)

Product (e.g., Category, Brand, Item)

Each dimension represents a distinct perspective of the data, making it easier to conduct in-depth analyses.

OLAP Operations

OLAP offers several powerful operations to explore and manipulate data within these multidimensional cubes. These operations include:

Slice

Definition: Extracts a single dimension from a cube, creating a “slice” of the data for specific analysis.

Example: If you have sales data across multiple years and products, a slice operation could isolate sales for 2024 only.

Result: A two-dimensional view of data for the chosen dimension.

Dice

Definition: Extracts a sub-cube by applying filters across multiple dimensions.

Example: If the data cube contains dimensions for time, product, and region, a dice operation might show sales of Laptops in the North America region for the year 2024.

Result: A smaller, filtered cube for focused analysis.

Drill-Down

Definition: Moves from summarized data to detailed data by navigating through hierarchical levels in a dimension.

Example: Drilling down from yearly sales to quarterly, monthly, or daily sales.

Result: More granular insights.

Drill-Up (or Roll-Up)

Definition: Aggregates detailed data into higher-level summaries.

Example: Rolling up daily sales to summarize monthly or yearly performance.

Result: Higher-level trends and patterns.

Pivot (or Rotate)

Definition: Rotates the data cube to view it from different perspectives, changing the layout of dimensions.

Example: Switching rows and columns to view sales by product category instead of sales by region.

Result: A reoriented view for alternative insights.

Aggregation

Definition: Summarizes data by applying mathematical functions like SUM, AVERAGE, COUNT, etc.

Example: Calculating total sales across all regions or the average revenue per product.

Result: A concise representation of data.

Detailed Examples

Multidimensional Data Cube

Imagine a company has sales data organized in a cube with the following dimensions:

Time: Years → Quarters → Months

Location: Country → Region → City

Product: Category → Brand → Item

Each cell in the cube holds a value, such as total sales.

Applying OLAP Operations

Slice: Select sales data for 2024.

Dice: Focus on Laptop sales in North America during Q1 2024.

Drill-Down: From yearly sales, drill down to quarterly sales for further analysis.

Roll-Up: Summarize city-level sales to the region level.

Pivot: Switch the dimensions to analyze sales by product categories rather than by time.

Advantages of OLAP

Multidimensional Analysis: Enables quick insights across various dimensions.
Speed: Pre-computed aggregates speed up queries.
User-Friendly: Business users can perform complex analysis without programming knowledge.
Customizable Views: Data can be sliced, diced, and pivoted easily.

OLAP Use Cases

Sales Analysis: Track performance across products, regions, and time.

Financial Planning: Budget forecasting and variance analysis.

Marketing: Campaign effectiveness and customer segmentation.

Supply Chain: Inventory analysis and demand forecasting.

OLAP is fundamental in decision support systems and is widely used in business intelligence to enable data-driven strategies. Let me know if you want further elaboration on any of these points!

The post OLAP (Online Analytical Processing) appeared first on prodSens.live.

Different file formats, a benchmark doing basic operations

Ann Smarty — Sun, 10 Mar 2024 21:20:28 +0000

Recently, I’ve been designing a data lake to store different types of data from various sources, catering to diverse demands across different areas and levels. To determine the best file type for storing this data, I compiled points of interest, considering the needs and demands of different areas. These points include:

Tool Compatibility

Tool compatibility refers to which tools can write and read a specific file type. No/low code tools are crucial, especially when tools like Excel/LibreOffice play a significant role in operational layers where collaborators may have less technical knowledge to use other tools.

Storage

How much extra or less space will a particular file type cost in the data lake? While non-volatile memory is relatively cheap nowadays, both on-premise and in the cloud, with a large volume of data, any savings and storage optimization can make a difference in the final balance.

Reading

How long do the tools that will consume the data take to open and read the file? In applications where reading seconds matter, sacrificing compatibility and storage for gains in processing time becomes crucial in the data pipeline architecture planning.

Writing

How long will the tools used by our data team take to generate the file in the data lake? If immediate file availability is a priority, this is an attribute we would like to minimize as much as possible.

Query

Some services will directly consume data from the file and perform grouping and filtering functions. Therefore, it’s essential to consider how much time these operations will take to make the correct choice in our data solution.

Benchmark

Files

1 – IBM Transactions for Anti Money Laundering (AML)

Rows: 31 million
Columns: 11
Types: Timestamp, String, Integer, Digital, Boolean

2 – Malware Detection in Network Traffic Data

Rows: 6 million
Columns: 23
Types: String, Integer

Number of Tests

15 tests were conducted for each operation on each file, and the results in the graphs represent the average of each test iteration’s results. The only variable unaffected by the number of tests is the file size, which remains the same regardless of how many times it is written.

Why 2 datasets?

I chose two completely different datasets. The first is significantly larger than the second and contains many null values represented by “-” and many columns with duplicate values where the distinction is low. In contrast, the first dataset has few columns with little data variability and contains more complex types such as timestamps. These characteristics highlight the distinctions, strengths, and weaknesses of each format.

Script

The script used for benchmarking is open on GitHub for anyone who wants to check or conduct their benchmarks with their files, which I strongly recommend.

file-format-benchmark: benchmark script of key operations between different file formats

Tools

I will use Python with Spark for the benchmark. Spark allows native queries on different file types, unlike Pandas, which requires an extra library to achieve this. Additionally, Spark is more performant in handling larger datasets, and the datasets used in this benchmark are relatively large, where Pandas struggled.

Env:
Python version: 3.11.7
Spark version: 3.5.0
Hadoop version: 3.4.1

Benchmark Results

Tool Compatibility

Although I wanted to measure tool compatibility, I couldn’t find a way to do it, so I’ll share my opinion. For pipelines with downstream stakeholders who have more technical knowledge (data scientists, machine learning engineers, etc.), the file format matters little. With a library/framework in any programming language, you can manipulate information from a file in any format. However, for non-technical stakeholders like business analysts, C-level executives, or other collaborators who work directly with product/service production, the scenario changes. These individuals often use tools like Excel, LibreOffice, PowerBI, or Tableau (which, despite having more native readers, do not support Avro or ORC). In cases where files are consumed “manually” by people, you will almost always opt for CSV or JSON. These formats, being plain-text, can be opened, read, and understood in any text editor. Additionally, all kinds of tools can read structured data in files in these formats. Parquet still has some compatibility, being the column storage type with the most support and attention from the community. On the other hand, ORC and Avro have very little support and can be challenging to find parsers and serializers in non-Apache tools.

In summary, CSV and JSON have a significant advantage over the others, and you will likely choose them when your stakeholders are directly handling the files and lack technical knowledge.

Storage

Dataset 1:

Dataset 2:

To calculate storage, we loaded the dataset in CSV format, rewrote it in all formats (including CSV itself), and listed the amount of space they occupy.

The graphs show a significant advantage for JSON, which was three times larger than the second-largest file (CSV) in both cases. The difference is so pronounced due to the way JSON is written, following a schema of a list of objects with key-value pairs, where the key is the column, and the value is the column’s value in that tuple. This results in unnecessary schema redundancy, always inferring the column name in all values. In addition to being plain-text without any compression, similar to CSV, JSON shows the two worst performances in terms of storage. Parquet, ORC, and Avro had very similar results, highlighting their efficiency in storage compared to more common types. The key reasons for this advantage are that Parquet and Avro are binary files, offering a storage advantage. Furthermore, Parquet and ORC are columnar format files that significantly reduce data redundancy, avoiding waste and optimizing space. All three formats have highly efficient compression methods.

In summary, CSV and JSON are by no means the best for storage optimization, especially in cases like storing logs or data with no immediate importance but cannot be discarded.

Reading

Dataset 1:

Dataset 2:

In the reading operation, we timed the dataset loading and printed the first 5 rows.

In reading, there is a peculiar case: despite increasing the file size difference several times (3x), the only format with a visible and relevant difference was JSON. This occurs solely due to the way JSON is written, making it costly for the Spark parser to work with that amount of metadata (redundant schema). The growth in reading time is exponential with the file size. As for why CSV performed as well as ORC and Parquet, it is because CSV is extremely simple, lacking metadata like a schema with types or field names. It is quick for the Spark parser to read, separate, and assess the column types of CSV files, unlike ORC and, especially, Parquet, which have a large amount of metadata useful in cases of files with more fields, complex types, and a larger amount of data. The difference between Avro, Parquet, and ORC is minimal and varies depending on the state of the cluster/machine, simultaneous tasks, and the data file layout. In the case of these datasets in the reading operation, it is challenging to evaluate the difference; it becomes more evident when scaling these files to several times larger than the datasets we are working with.

In summary, CSV, Parquet, ORC, and Avro had almost no difference in reading performance, while JSON cannot be considered as an option in cases where fast data reading is required. Few cases prioritize reading alone; it is generally evaluated along with another task like a query. If you are looking for the most performant file type for this operation, you should consider conducting your own tests.

Writing

Dataset 1:

Dataset 2:

In the writing operation, we read a .csv file and rewrote it in the respective format, only counting the writing time.

In writing, there was a surprise: JSON was not the slowest to be written in the first dataset; it was actually ORC. However, in the second dataset, JSON took the longest. This discrepancy is due to the second dataset having more columns, meaning more metadata to be written. While ORC is a binary file with static typing of data, similar to Parquet, the difference is that ORC applies “better” optimization and compression techniques, requiring more processing power and time. This justifies the query time (which we will see next) and the generated file size, which is smaller in almost all cases than Parquet files. CSV had good performance because it is a very simple format, lacking additional metadata such as schema and types or redundant metadata like JSON. On a larger scale, more complex files would have better performance than CSV. Avro also has its benefits and had a very positive result in dataset 1, outperforming Parquet and ORC with a significant advantage. This probably happened due to the data layout favoring Avro’s optimizations, which differ from Parquet and ORC.

In summary, Avro, despite not being a format with much fame or community support, is a good choice in situations where you want the quick availability of your files for stakeholders to consume. It starts making a difference when scaling to several GBs of data, where the difference becomes 20-30 minutes instead of 30-40 seconds.

Query

Dataset 1:

Dataset 2:

In the query operation, the dataset was loaded, and a query with only one WHERE clause filtering a unique value was performed, followed by printing the respective tuple.

In the first file, all formats had good performance, and the graph scales give the impression that Parquet had poor performance. However, the differences are minimal. Since dataset 2 is much smaller, we believe the query results are very susceptible to external factors. Therefore, we will focus on explaining the results in the first dataset. As mentioned earlier, ORC performs well compared even to Avro, which had excellent performance in other operations. Still, Parquet leads this ranking as the fastest query result. Why? Parquet, being the default format for Spark, indicates much about how the framework works with this format. It incorporates various query optimization techniques, many consolidated in DBMSs. One of the most famous is predicate pushdown, which essentially leaves the WHERE clauses at the end of the execution plan to reduce the amount of data read and examined on the disk. This is an optimization not present in ORC. Why do CSV and JSON lag so far behind? In this case, CSV and JSON are not the problem; the truth is that Parquet and ORC are very well optimized. All the benefits mentioned earlier, such as schema metadata, binary files, and columnar formats, give them a significant advantage. And where does Avro fit into this since it has many of these mentioned benefits? In terms of query optimization, Avro lags far behind ORC and Parquet. One of the points we can mention is column projection, which essentially computes only the specific columns used in the query rather than the entire dataset. This is present in ORC and Parquet but not in Avro. Logically, this is not the only thing that makes ORC and Parquet differ so much from Avro in terms of query optimization, but overall, Avro falls far behind in query optimization.

In summary, when working with large files, with both simple and complex queries, you will want to work with Parquet or ORC. Both have many query optimizations that will deliver results much faster compared to other formats. This difference is already evident in files slightly smaller than dataset 1 and becomes even more apparent in larger files.

Conclusion

In a data engineering environment where you need to serve various stakeholders, consume from various sources, and store data in different storage systems, operations such as reading, writing, or queries are widely affected by the file format. Here, we see what issues certain formats may have, evaluating the main points raised in data environment constructions.

Even though Parquet is the “darling” of the community, we were able to highlight some of its strengths, such as query performance, but also show that there are better options for certain scenarios, such as ORC in storage optimization.

The performances of these operations for each format also depend heavily on the tool you are using and how you are using it (environment and available resources). The results from Spark probably will not differ much from other more robust frameworks like Duckdb or Flink, but we recommend that you conduct your tests before making any decisions that will have a significant impact on other areas of the business.

The post Different file formats, a benchmark doing basic operations appeared first on prodSens.live.

Introduction to Data Science

Lauren Fox — Thu, 28 Dec 2023 22:24:48 +0000

Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. It involves the use of statistics, data analysis, machine learning, and related methods to understand and analyze actual phenomena with data.

The Evolution of Data Science

Data Science has evolved from statistics and data analysis over the years. With the advent of computers and an increase in data generation, the need for data processing and analysis grew. This evolution led to the development of more sophisticated data analysis methods and the emergence of machine learning and artificial intelligence as key components of data science.

Key Components of Data Science

Data Mining
Data mining involves exploring and analyzing large blocks of information to glean meaningful patterns and trends. It can involve the use of sophisticated data analysis tools to discover previously unknown, valid patterns and relationships in large data sets.

Machine Learning
Machine Learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. It focuses on the development of computer programs that can access data and use it to learn for themselves.

Big Data
Big Data refers to data that is so large, fast, or complex that it’s difficult or impossible to process using traditional methods. The act of accessing and storing large amounts of information for analytics has been around a long time, but the concept of big data gained momentum in the early 2000s.

Statistical Analysis
This refers to the collection, analysis, interpretation, presentation, and organization of data. Statistical analysis can be used in a wide range of fields, including social sciences, business, and engineering.

How Does Data Science Work?

Ask a Question: It all starts with a curious question about something we want to know.
Gather Information: Just like collecting clues, we gather all the data (information) we need.
Clean the Mess: We tidy up our data, sorting it out so it’s easier to use.
Start Detective Work: This is where we explore our data to find interesting trends or patterns.
Create a Data Model: Think of this like a mini-experiment to test our guesses about the data.
Check the Results: We see if our mini-experiment worked well.
Use Your Findings: Finally, we use what we learned to make decisions or solve problems.

Applications of Data Science

Data science has a wide range of applications including business intelligence, health care, finance, forecasting, image and speech recognition, and many more. It is used to predict customer behavior, enhance business operations, forecast trends, and make informed decisions.

Conclusion

The field of Data Science is continuously evolving as technology advances. It is becoming increasingly important in various sectors for making more informed and accurate decisions. As data continues to grow in volume and complexity, the role of data scientists is becoming more pivotal in interpreting the data for successful business outcomes.

The post Introduction to Data Science appeared first on prodSens.live.

Big data models 📊 vs. Computer memory 💾

Alex Daintith — Thu, 23 Nov 2023 13:24:47 +0000

Data pipelines are the backbone of any data-intensive project. As datasets grow beyond memory size (“out-of-core”), handling them efficiently becomes challenging.
Dask enables effortless management of large datasets (out-of-core), offering great compatibility with Numpy and Pandas.

This article focuses on the seamless integration of Dask (for handling out-of-core data) with Taipy, a Python library used for pipeline orchestration and scenario management.

Taipy – Your web application builder

A little bit about us. Taipy is an open-source library designed for easy development for both front-end (GUI) and your ML/Data pipeline(s).
No other knowledge is required (no CSS, no nothing!).
It has been designed to expedite application development, from initial prototypes to production-ready applications.

Star the Taipy repository

We’re almost at 1000 stars and couldn’t do this without you

1. Sample Application

Integrating Dask and Taipy is demonstrated best with an example. In this article, we’ll consider a data workflow with 4 tasks:

Data Preprocessing and Customer Scoring
Read and process a large dataset using Dask.
Feature Engineering and Segmentation
Score customers based on purchase behavior.
Segment Analysis
Segment customers into different categories based on these scores and other factors.
Summary Statistics for High-Value Customers
Analyze each customer segment to derive insights

We will explore the code of these 4 tasks in finer detail.
Note that this code is your Python code and is not using Taipy.
In a later section, we will show how you can use Taipy to model your existing data applications, and reap the benefits of its workflow orchestration with little effort.

The application will comprise of the following 5 files:

algos/
├─ algo.py  #  Our existing code with 4 tasks
data/
├─ SMALL_amazon_customers_data.csv  #  A sample dataset
app.ipynb  # Jupyter Notebook for running our sample data application
config.py  # Taipy configuration which models our data workflow
config.toml  # (Optional) Taipy configuration in TOML made using Taipy Studio

2. Introducing Taipy – A Comprehensive Solution

Taipy is more than just another orchestration tool.
Especially designed for ML engineers, data scientists, and Python developers, Taipy brings several essential and simple features.
Here are some key elements that make Taipy a compelling choice:

Pipeline execution registry
This feature enables developers and end-users to:
- Register each pipeline execution as a “Scenario” (a graph of tasks and data nodes);
- Precisely trace the lineage of each pipeline execution; and
- Compare scenarios with ease, monitor KPIs and provide invaluable insight for troubleshooting and fine-tuning parameters.
Pipeline versioning
Taipy’s robust scenario management enables you to adapt your pipelines to evolving project demands effortlessly.
Smart task orchestration
Taipy allows the developer to model the network of tasks and data sources easily.
This feature provides a built-in control over the execution of your tasks with:
- Parallel execution of your tasks; and
- Task “skipping”, i.e., choosing which tasks to execute and
  which to bypass.
Modular approach to task orchestration
Modularity isn’t just a buzzword with Taipy; it’s a core principle.
Setting up tasks and data sources that can be used interchangeably, resulting in a cleaner, more maintainable codebase.

3. Introducing Dask

Dask is a popular Python package for distributed computing. The Dask API implements the familiar Pandas, Numpy and Scikit-learn APIs - which makes learning and using Dask much more pleasant for the many data scientists whom are already familiar with these APIs.
If you’re new to Dask, check out the excellent 10-minute Introduction to Dask by the Dask team.

4. Application: Customer Analysis (algos/algo.py)

A graph of our 4 tasks (visualized in Taipy) which we will model in the next section.

Our existing code (without Taipy) comprises of 4 functions, which you can also see in the graph above:

Task 1: preprocess_and_score
Task 2: featurization_and_segmentation
Task 3: segment_analysis
Task 4: high_value_cust_summary_statistics

You can skim through the following algos/algo.py script which defines the 4 functions and then continue reading on for a brief description of what each function does:

### algos/algo.py
import time

import dask.dataframe as dd
import pandas as pd

def preprocess_and_score(path_to_original_data: str):
    print("__________________________________________________________")
    print("1. TASK 1: DATA PREPROCESSING AND CUSTOMER SCORING ...")
    start_time = time.perf_counter()  # Start the timer

    # Step 1: Read data using Dask
    df = dd.read_csv(path_to_original_data)

    # Step 2: Simplify the customer scoring formula
    df["CUSTOMER_SCORE"] = (
        0.5 * df["TotalPurchaseAmount"] / 1000 + 0.3 * df["NumberOfPurchases"] / 10 + 0.2 * df["AverageReviewScore"]
    )

    # Save all customers to a new CSV file
    scored_df = df[["CUSTOMER_SCORE", "TotalPurchaseAmount", "NumberOfPurchases", "TotalPurchaseTime"]]

    pd_df = scored_df.compute()

    end_time = time.perf_counter()  # Stop the timer
    execution_time = (end_time - start_time) * 1000  # Calculate the time in milliseconds
    print(f"Time of Execution: {execution_time:.4f} ms")

    return pd_df

def featurization_and_segmentation(scored_df, payment_threshold, score_threshold):
    print("__________________________________________________________")
    print("2. TASK 2: FEATURE ENGINEERING AND SEGMENTATION ...")

    # payment_threshold, score_threshold = float(payment_threshold), float(score_threshold)
    start_time = time.perf_counter()  # Start the timer

    df = scored_df

    # Feature: Indicator if customer's total purchase is above the payment threshold
    df["HighSpender"] = (df["TotalPurchaseAmount"] > payment_threshold).astype(int)

    # Feature: Average time between purchases
    df["AverageTimeBetweenPurchases"] = df["TotalPurchaseTime"] / df["NumberOfPurchases"]

    # Additional computationally intensive features
    df["Interaction1"] = df["TotalPurchaseAmount"] * df["NumberOfPurchases"]
    df["Interaction2"] = df["TotalPurchaseTime"] * df["CUSTOMER_SCORE"]
    df["PolynomialFeature"] = df["TotalPurchaseAmount"] ** 2

    # Segment customers based on the score_threshold
    df["ValueSegment"] = ["High Value" if score > score_threshold else "Low Value" for score in df["CUSTOMER_SCORE"]]

    end_time = time.perf_counter()  # Stop the timer
    execution_time = (end_time - start_time) * 1000  # Calculate the time in milliseconds
    print(f"Time of Execution: {execution_time:.4f} ms")

    return df

def segment_analysis(df: pd.DataFrame, metric):
    print("__________________________________________________________")
    print("3. TASK 3: SEGMENT ANALYSIS ...")
    start_time = time.perf_counter()  # Start the timer

    # Detailed analysis for each segment: mean/median of various metrics
    segment_analysis = (
        df.groupby("ValueSegment")
        .agg(
            {
                "CUSTOMER_SCORE": metric,
                "TotalPurchaseAmount": metric,
                "NumberOfPurchases": metric,
                "TotalPurchaseTime": metric,
                "HighSpender": "sum",  # Total number of high spenders in each segment
                "AverageTimeBetweenPurchases": metric,
            }
        )
        .reset_index()
    )

    end_time = time.perf_counter()  # Stop the timer
    execution_time = (end_time - start_time) * 1000  # Calculate the time in milliseconds
    print(f"Time of Execution: {execution_time:.4f} ms")

    return segment_analysis

def high_value_cust_summary_statistics(df: pd.DataFrame, segment_analysis: pd.DataFrame, summary_statistic_type: str):
    print("__________________________________________________________")
    print("4. TASK 4: ADDITIONAL ANALYSIS BASED ON SEGMENT ANALYSIS ...")
    start_time = time.perf_counter()  # Start the timer

    # Filter out the High Value customers
    high_value_customers = df[df["ValueSegment"] == "High Value"]

    # Use summary_statistic_type to calculate different types of summary statistics
    if summary_statistic_type == "mean":
        average_purchase_high_value = high_value_customers["TotalPurchaseAmount"].mean()
    elif summary_statistic_type == "median":
        average_purchase_high_value = high_value_customers["TotalPurchaseAmount"].median()
    elif summary_statistic_type == "max":
        average_purchase_high_value = high_value_customers["TotalPurchaseAmount"].max()
    elif summary_statistic_type == "min":
        average_purchase_high_value = high_value_customers["TotalPurchaseAmount"].min()

    median_score_high_value = high_value_customers["CUSTOMER_SCORE"].median()

    # Fetch the summary statistic for 'TotalPurchaseAmount' for High Value customers from segment_analysis
    segment_statistic_high_value = segment_analysis.loc[
        segment_analysis["ValueSegment"] == "High Value", "TotalPurchaseAmount"
    ].values[0]

    # Create a DataFrame to hold the results
    result_df = pd.DataFrame(
        {
            "SummaryStatisticType": [summary_statistic_type],
            "AveragePurchaseHighValue": [average_purchase_high_value],
            "MedianScoreHighValue": [median_score_high_value],
            "SegmentAnalysisHighValue": [segment_statistic_high_value],
        }
    )

    end_time = time.perf_counter()  # Stop the timer
    execution_time = (end_time - start_time) * 1000  # Calculate the time in milliseconds
    print(f"Time of Execution: {execution_time:.4f} ms")

    return result_df

Task 1 – Data Preprocessing and Customer Scoring

Python function: preprocess_and_score
This is the first step in your pipeline and perhaps the most crucial.
It reads a large dataset using Dask, designed for larger-than-memory computation.
It then calculates a “Customer Score” in a DataFrame named scored_df, based on various metrics like “TotalPurchaseAmount”, “NumberOfPurchases”, and “AverageReviewScore”.

After reading and processing the dataset with Dask, this task will output a Pandas DataFrame for further use in the remaining 3 tasks.

Task 2 – Feature Engineering and Segmentation

Python function: featurization_and_segmentation
This task takes the scored DataFrame and adds new features, such as an indicator for high spending.
It also segments the customers based on their scores.

Task 3 – Segment Analysis

Python function: segment_analysis
This task takes the segmented DataFrame and performs a group-wise analysis based on the customer segments to calculate various metrics.

Task 4 – Summary Statistics for High-Value Customers

Python function: high_value_cust_summary_statistics
This task performs an in-depth analysis of the high-value customer segment and returns summary statistics.

5. Modelling the Workflow in Taipy (config.py)

Taipy DAG — Taipy “Tasks” in orange and “Data Nodes” in blue.

In this section, we will create the Taipy configuration which models the variables/parameters (represented as “Data Nodes”) and functions (represented as “Tasks”) in Taipy.

Notice that this configuration in the following config.py script is akin to defining variables and functions — except that we are instead defining “blueprint variables” (Data Nodes) and “blueprint functions” (Tasks).
We are informing Taipy on how to call the functions we defined earlier, default values of Data Nodes (which we may overwrite at runtime), and if Tasks may be skipped:

### config.py
from taipy import Config

from algos.algo import (
    preprocess_and_score,
    featurization_and_segmentation,
    segment_analysis,
    high_value_cust_summary_statistics,
)

# -------------------- Data Nodes --------------------

path_to_data_cfg = Config.configure_data_node(id="path_to_data", default_data="data/customers_data.csv")

scored_df_cfg = Config.configure_data_node(id="scored_df")

payment_threshold_cfg = Config.configure_data_node(id="payment_threshold", default_data=1000)

score_threshold_cfg = Config.configure_data_node(id="score_threshold", default_data=1.5)

segmented_customer_df_cfg = Config.configure_data_node(id="segmented_customer_df")

metric_cfg = Config.configure_data_node(id="metric", default_data="mean")

segment_result_cfg = Config.configure_data_node(id="segment_result")

summary_statistic_type_cfg = Config.configure_data_node(id="summary_statistic_type", default_data="median")

high_value_summary_df_cfg = Config.configure_data_node(id="high_value_summary_df")

# -------------------- Tasks --------------------

preprocess_and_score_task_cfg = Config.configure_task(
    id="preprocess_and_score",
    function=preprocess_and_score,
    skippable=True,
    input=[path_to_data_cfg],
    output=[scored_df_cfg],
)

featurization_and_segmentation_task_cfg = Config.configure_task(
    id="featurization_and_segmentation",
    function=featurization_and_segmentation,
    skippable=True,
    input=[scored_df_cfg, payment_threshold_cfg, score_threshold_cfg],
    output=[segmented_customer_df_cfg],
)

segment_analysis_task_cfg = Config.configure_task(
    id="segment_analysis",
    function=segment_analysis,
    skippable=True,
    input=[segmented_customer_df_cfg, metric_cfg],
    output=[segment_result_cfg],
)

high_value_cust_summary_statistics_task_cfg = Config.configure_task(
    id="high_value_cust_summary_statistics",
    function=high_value_cust_summary_statistics,
    skippable=True,
    input=[segment_result_cfg, segmented_customer_df_cfg, summary_statistic_type_cfg],
    output=[high_value_summary_df_cfg],
)

scenario_cfg = Config.configure_scenario(
    id="scenario_1",
    task_configs=[
        preprocess_and_score_task_cfg,
        featurization_and_segmentation_task_cfg,
        segment_analysis_task_cfg,
        high_value_cust_summary_statistics_task_cfg,
    ],
)

You can read more about configuring Scenarios, Tasks and Data Nodes in the documentation here.

Taipy Studio

Taipy Studio is a VS Code extension from Taipy that allows you to build and visualize your pipelines with simple drag-and-drop interactions.
Taipy Studio provides a graphical editor where you can create your Taipy configurations stored in TOML files that your Taipy application can load to run.
The editor represents Scenarios as graphs, where nodes are Data Nodes and Tasks.

As an alternative for the config.py script in this section, you may instead use Taipy Studio to generate a config.toml configuration file.
The penultimate section in this article will provide a guide on how to create the config.toml configuration file using Taipy Studio.

6. Scenario Creation and Execution

Executing a Taipy scenario involves:

Loading the config;
Running the Taipy Core service; and
Creating and submitting the scenario for execution.

Here’s the basic code template:

import taipy as tp
from config import scenario_cfg  # Import the Scenario configuration
tp.Core().run()  # Start the Core service
scenario_1 = tp.create_scenario(scenario_cfg)  # Create a Scenario instance
scenario_1.submit()  # Submit the Scenario for execution

# Total runtime: 74.49s

Skip unnecessary task executions

One of Taipy’s most practical features is its ability to skip a task execution if its output is already computed.
Let’s explore this with some scenarios:

Changing Payment Threshold

# Changing Payment Threshold to 1600
scenario_1.payment_threshold.write(1600)
scenario_1.submit()

# Total runtime: 31.499s

What Happens: Taipy is intelligent enough to skip Task 1 because the payment threshold only affects Task 2.
In this case, we are seeing more than 50% reduction in execution time by running your pipeline with Taipy.

Changing Metric for Segment Analysis

# Changing metric to median
scenario_1.metric.write("median")
scenario_1.submit()

# Total runtime: 23.839s

What Happens: In this case, only Task 3 and Task 4 are affected. Taipy smartly skips Task 1 and Task 2.

Changing Summary Statistic Type

# Changing summary_statistic_type to max
scenario_1.summary_statistic_type.write("max")
scenario_1.submit()

# Total runtime: 5.084s

What Happens: Here, only Task 4 is affected, and Taipy executes only this task, skipping the rest.
Taipy’s smart task skipping is not just a time-saver; it’s a resource optimizer that becomes incredibly useful when dealing with large datasets.

7. Taipy Studio

You may use Taipy Studio to build the Taipy config.toml configuration file in place of defining the config.py script.

First, install the Taipy Studio extension using the Extension Marketplace.

Creating the Configuration

Create a Config File: In VS Code, navigate to Taipy Studio, and initiate a new TOML configuration file by clicking the + button on the parameters window.

Then right-click on it and select Taipy: Show View.

Adding entities to your Taipy Configurations:
On the right-hand side of Taipy Studio, you should see a list of 3 icons that can be used to set up your pipeline.

The first item is for adding a Data Node. You can link any Python object to Taipy’s Data Nodes.
The second item is for adding a Task. A Task can be linked to a predefined Python function.
The third item is for adding a Scenario. Taipy allows you to have more than one Scenario in a configuration.

– Data Nodes

Input Data Node: Create a Data Node named “path_to_data”, then navigate to the Details tab, add a new property “default_data”, and paste “SMALL_amazon_customers_data.csv” as the path to your dataset.

Intermediate Data Nodes: We’ll need to add four more Data Nodes: “scored_df”, “segmented_customer_df”, “segment_result”, “high_value_summary_df”. With Taipy’s intelligent design, you don’t need to configure anything for these intermediate data nodes; the system handles them smartly.

Intermediate Data Nodes with Defaults: We finally define four more intermediate Data Nodes with the “default_data” property set to the following:

payment_threshold: “1000:int”

score_threshold: “1.5:float”
metric: “mean”
summary_statistic_type: “median”

– Tasks

Clicking on the Add Task button, you can configure a new Task.
Add four Tasks, then link each Task to the appropriate function under the Details tab.
Taipy Studio will scan through your project folder and provide a categorized list of functions to choose from, sorted by the Python file.

Task 1 (preprocess_and_score): In Taipy studio, you’d click the Task icon to add a new Task.
You’d specify the input as “path_to_data” and the output as “scored_df”.
Then, under the Details tab, you’d link this Task to the algos.algo.preprocess_and_score function.

Task 2 (featurization_and_segmentation): Similar to Task 1, you’d specify the inputs (”scored_df”, ”payment_threshold”, ”score_threshold”) and the output (”segmented_customer_df”). Link this Task to the algos.algo.featurization_and_segmentation function.

Task 3 (segment_analysis): Inputs would be “segmented_customer_df” and “metric”, and the output would be “segment_result”.
Link to the algos.algo.segment_analysis function.

Task 4 (high_value_cust_summary_statistics): Inputs include “segment_result”, “segmented_customer_df”, and “summary_statistic_type”. The output is “high_value_summary_df”. Link to the algos.algo.high_value_cust_summary_statistics function.

Conclusion

Taipy offers an intelligent way to build and manage data pipelines.
The skippable feature in particular, makes it a powerful tool for optimizing computational resources and time, particularly beneficial in scenarios involving large datasets.
While Dask provides the raw power for data manipulation, Taipy adds a layer of intelligence, making your pipeline not just robust but also smart.

Additional Resources
For the complete code and TOML configuration, you can visit this GitHub repository. To dive deeper into Taipy, here’s the official documentation.

Once you understand Taipy Scenario management, you become much more efficient building data driven application for your end users. Just focus on your algorithms and Taipy handles the rest.

Hope you enjoyed this article!

The post Big data models 📊 vs. Computer memory 💾 appeared first on prodSens.live.

dataengineering Archives - prodSens.live

What I Learned Cleaning 1 Million Rows of CSV Data Without Pandas

Let’s dive in.

Why Not Pandas?

My Setup: Streaming CSV Processing in Node.js

Common Data Issues I Ran Into (and How I Fixed Them)

Pro Tips for Large CSV Cleaning

What I’d Do Differently Next Time

Final Thoughts

Introduction to Data Engineering Concepts |18| The Power of Dremio in the Modern Lakehouse

Free Resources

Built for the Modern Stack

Iceberg Clustering: Smarter Data Organization

Autonomous Reflections: AI for Query Optimization

Governance and Discoverability with Polaris

AI-Ready Data, Out of the Box

Real-World Results

Conclusion

Pair and Transpose Adjacent Records within the Group - From SQL to SPL #13

Problem description & analysis:

Code comparisons:

Study Notes 6.3-4: What is Kafka & Confluent Cloud

1. Introduction to Kafka in Stream Processing

2. The Notice Board Analogy

3. Understanding Kafka’s Message Structure

4. Key Features and Benefits of Kafka

5. Kafka in Modern Architectures

6. Change Data Capture (CDC) and Kafka Connect

7. Practical Considerations and Additional Insights

8. Summary and Real-World Applications

Confluent Cloud

🤯 #NODES24: a practical path to Cloud-Native Knowledge Graph Automation & AI Agents

About IT system cartography

Our speech about “IT holism” at #nodes22

adriens for opt-nc ・ Dec 5 ’22

Discover a country UN SDGs concerns w/ Open Metadata on Neo4J

adriens ・ Nov 13 ’23

#NODES24 28 seconds teaser

#NODES24 30′ session

Short Ideas

Use LLM feedbacks for better API designs

API and Knowledge Graph thoughts experiments

The tremendous impacts of “good” documentation

🚀 Beyond Data Ingestion: Advanced Strategies for Optimizing API Data Pipelines

Key Lessons From Real-World Scenarios

1. Schema Alignment Simplified

2. Automating Duplicate Handling

Pro Tip:

3. Optimizing Throughput Without Over-Scaling Hardware

4. Tackling API Timeouts for Long-Running Data Retrieval

Key Takeaways

Let’s Continue the Conversation

OLAP (Online Analytical Processing)

Different file formats, a benchmark doing basic operations

Tool Compatibility

Storage

Reading

Writing

Query

Benchmark

Files

1 – IBM Transactions for Anti Money Laundering (AML)

2 – Malware Detection in Network Traffic Data

Number of Tests

Why 2 datasets?

Script

file-format-benchmark: benchmark script of key operations between different file formats

Tools

Benchmark Results

Tool Compatibility

Storage

Reading

Writing

Query

Conclusion

Introduction to Data Science

The Evolution of Data Science

Key Components of Data Science

How Does Data Science Work?

Applications of Data Science

`#NODES24` 28 seconds teaser

`#NODES24` 30′ session