dataengineering Archives - ProdSens.live https://prodsens.live/tag/dataengineering/ News for Project Managers - PMI Sun, 10 Mar 2024 21:20:28 +0000 en-US hourly 1 https://wordpress.org/?v=6.5.5 https://prodsens.live/wp-content/uploads/2022/09/prod.png dataengineering Archives - ProdSens.live https://prodsens.live/tag/dataengineering/ 32 32 Different file formats, a benchmark doing basic operations https://prodsens.live/2024/03/10/different-file-formats-a-benchmark-doing-basic-operations/?utm_source=rss&utm_medium=rss&utm_campaign=different-file-formats-a-benchmark-doing-basic-operations https://prodsens.live/2024/03/10/different-file-formats-a-benchmark-doing-basic-operations/#respond Sun, 10 Mar 2024 21:20:28 +0000 https://prodsens.live/2024/03/10/different-file-formats-a-benchmark-doing-basic-operations/ different-file-formats,-a-benchmark-doing-basic-operations

Recently, I’ve been designing a data lake to store different types of data from various sources, catering to…

The post Different file formats, a benchmark doing basic operations appeared first on ProdSens.live.

]]>
different-file-formats,-a-benchmark-doing-basic-operations

Recently, I’ve been designing a data lake to store different types of data from various sources, catering to diverse demands across different areas and levels. To determine the best file type for storing this data, I compiled points of interest, considering the needs and demands of different areas. These points include:

Tool Compatibility

Tool compatibility refers to which tools can write and read a specific file type. No/low code tools are crucial, especially when tools like Excel/LibreOffice play a significant role in operational layers where collaborators may have less technical knowledge to use other tools.

Storage

How much extra or less space will a particular file type cost in the data lake? While non-volatile memory is relatively cheap nowadays, both on-premise and in the cloud, with a large volume of data, any savings and storage optimization can make a difference in the final balance.

Reading

How long do the tools that will consume the data take to open and read the file? In applications where reading seconds matter, sacrificing compatibility and storage for gains in processing time becomes crucial in the data pipeline architecture planning.

Writing

How long will the tools used by our data team take to generate the file in the data lake? If immediate file availability is a priority, this is an attribute we would like to minimize as much as possible.

Query

Some services will directly consume data from the file and perform grouping and filtering functions. Therefore, it’s essential to consider how much time these operations will take to make the correct choice in our data solution.

Benchmark

Files

1 – IBM Transactions for Anti Money Laundering (AML)

Rows: 31 million
Columns: 11
Types: Timestamp, String, Integer, Digital, Boolean

2 – Malware Detection in Network Traffic Data

Rows: 6 million
Columns: 23
Types: String, Integer

Number of Tests

15 tests were conducted for each operation on each file, and the results in the graphs represent the average of each test iteration’s results. The only variable unaffected by the number of tests is the file size, which remains the same regardless of how many times it is written.

Why 2 datasets?

I chose two completely different datasets. The first is significantly larger than the second and contains many null values represented by “-” and many columns with duplicate values where the distinction is low. In contrast, the first dataset has few columns with little data variability and contains more complex types such as timestamps. These characteristics highlight the distinctions, strengths, and weaknesses of each format.

Script

The script used for benchmarking is open on GitHub for anyone who wants to check or conduct their benchmarks with their files, which I strongly recommend.

file-format-benchmark: benchmark script of key operations between different file formats

Tools

I will use Python with Spark for the benchmark. Spark allows native queries on different file types, unlike Pandas, which requires an extra library to achieve this. Additionally, Spark is more performant in handling larger datasets, and the datasets used in this benchmark are relatively large, where Pandas struggled.

Env:
Python version: 3.11.7
Spark version: 3.5.0
Hadoop version: 3.4.1

Benchmark Results

Tool Compatibility

Although I wanted to measure tool compatibility, I couldn’t find a way to do it, so I’ll share my opinion. For pipelines with downstream stakeholders who have more technical knowledge (data scientists, machine learning engineers, etc.), the file format matters little. With a library/framework in any programming language, you can manipulate information from a file in any format. However, for non-technical stakeholders like business analysts, C-level executives, or other collaborators who work directly with product/service production, the scenario changes. These individuals often use tools like Excel, LibreOffice, PowerBI, or Tableau (which, despite having more native readers, do not support Avro or ORC). In cases where files are consumed “manually” by people, you will almost always opt for CSV or JSON. These formats, being plain-text, can be opened, read, and understood in any text editor. Additionally, all kinds of tools can read structured data in files in these formats. Parquet still has some compatibility, being the column storage type with the most support and attention from the community. On the other hand, ORC and Avro have very little support and can be challenging to find parsers and serializers in non-Apache tools.

In summary, CSV and JSON have a significant advantage over the others, and you will likely choose them when your stakeholders are directly handling the files and lack technical knowledge.

Storage

Dataset 1:
Storage results graph

Dataset 2:
Storage results graph
To calculate storage, we loaded the dataset in CSV format, rewrote it in all formats (including CSV itself), and listed the amount of space they occupy.

The graphs show a significant advantage for JSON, which was three times larger than the second-largest file (CSV) in both cases. The difference is so pronounced due to the way JSON is written, following a schema of a list of objects with key-value pairs, where the key is the column, and the value is the column’s value in that tuple. This results in unnecessary schema redundancy, always inferring the column name in all values. In addition to being plain-text without any compression, similar to CSV, JSON shows the two worst performances in terms of storage. Parquet, ORC, and Avro had very similar results, highlighting their efficiency in storage compared to more common types. The key reasons for this advantage are that Parquet and Avro are binary files, offering a storage advantage. Furthermore, Parquet and ORC are columnar format files that significantly reduce data redundancy, avoiding waste and optimizing space. All three formats have highly efficient compression methods.

In summary, CSV and JSON are by no means the best for storage optimization, especially in cases like storing logs or data with no immediate importance but cannot be discarded.

Reading

Dataset 1:
Reading results graph

Dataset 2:
Reading results graph
In the reading operation, we timed the dataset loading and printed the first 5 rows.

In reading, there is a peculiar case: despite increasing the file size difference several times (3x), the only format with a visible and relevant difference was JSON. This occurs solely due to the way JSON is written, making it costly for the Spark parser to work with that amount of metadata (redundant schema). The growth in reading time is exponential with the file size. As for why CSV performed as well as ORC and Parquet, it is because CSV is extremely simple, lacking metadata like a schema with types or field names. It is quick for the Spark parser to read, separate, and assess the column types of CSV files, unlike ORC and, especially, Parquet, which have a large amount of metadata useful in cases of files with more fields, complex types, and a larger amount of data. The difference between Avro, Parquet, and ORC is minimal and varies depending on the state of the cluster/machine, simultaneous tasks, and the data file layout. In the case of these datasets in the reading operation, it is challenging to evaluate the difference; it becomes more evident when scaling these files to several times larger than the datasets we are working with.

In summary, CSV, Parquet, ORC, and Avro had almost no difference in reading performance, while JSON cannot be considered as an option in cases where fast data reading is required. Few cases prioritize reading alone; it is generally evaluated along with another task like a query. If you are looking for the most performant file type for this operation, you should consider conducting your own tests.

Writing

Dataset 1:
Write results graph

Dataset 2:
Write results graph
In the writing operation, we read a .csv file and rewrote it in the respective format, only counting the writing time.

In writing, there was a surprise: JSON was not the slowest to be written in the first dataset; it was actually ORC. However, in the second dataset, JSON took the longest. This discrepancy is due to the second dataset having more columns, meaning more metadata to be written. While ORC is a binary file with static typing of data, similar to Parquet, the difference is that ORC applies “better” optimization and compression techniques, requiring more processing power and time. This justifies the query time (which we will see next) and the generated file size, which is smaller in almost all cases than Parquet files. CSV had good performance because it is a very simple format, lacking additional metadata such as schema and types or redundant metadata like JSON. On a larger scale, more complex files would have better performance than CSV. Avro also has its benefits and had a very positive result in dataset 1, outperforming Parquet and ORC with a significant advantage. This probably happened due to the data layout favoring Avro’s optimizations, which differ from Parquet and ORC.

In summary, Avro, despite not being a format with much fame or community support, is a good choice in situations where you want the quick availability of your files for stakeholders to consume. It starts making a difference when scaling to several GBs of data, where the difference becomes 20-30 minutes instead of 30-40 seconds.

Query

Dataset 1:
Query results graph

Dataset 2:
Query results graph
In the query operation, the dataset was loaded, and a query with only one WHERE clause filtering a unique value was performed, followed by printing the respective tuple.

In the first file, all formats had good performance, and the graph scales give the impression that Parquet had poor performance. However, the differences are minimal. Since dataset 2 is much smaller, we believe the query results are very susceptible to external factors. Therefore, we will focus on explaining the results in the first dataset. As mentioned earlier, ORC performs well compared even to Avro, which had excellent performance in other operations. Still, Parquet leads this ranking as the fastest query result. Why? Parquet, being the default format for Spark, indicates much about how the framework works with this format. It incorporates various query optimization techniques, many consolidated in DBMSs. One of the most famous is predicate pushdown, which essentially leaves the WHERE clauses at the end of the execution plan to reduce the amount of data read and examined on the disk. This is an optimization not present in ORC. Why do CSV and JSON lag so far behind? In this case, CSV and JSON are not the problem; the truth is that Parquet and ORC are very well optimized. All the benefits mentioned earlier, such as schema metadata, binary files, and columnar formats, give them a significant advantage. And where does Avro fit into this since it has many of these mentioned benefits? In terms of query optimization, Avro lags far behind ORC and Parquet. One of the points we can mention is column projection, which essentially computes only the specific columns used in the query rather than the entire dataset. This is present in ORC and Parquet but not in Avro. Logically, this is not the only thing that makes ORC and Parquet differ so much from Avro in terms of query optimization, but overall, Avro falls far behind in query optimization.

In summary, when working with large files, with both simple and complex queries, you will want to work with Parquet or ORC. Both have many query optimizations that will deliver results much faster compared to other formats. This difference is already evident in files slightly smaller than dataset 1 and becomes even more apparent in larger files.

Conclusion

In a data engineering environment where you need to serve various stakeholders, consume from various sources, and store data in different storage systems, operations such as reading, writing, or queries are widely affected by the file format. Here, we see what issues certain formats may have, evaluating the main points raised in data environment constructions.

Even though Parquet is the “darling” of the community, we were able to highlight some of its strengths, such as query performance, but also show that there are better options for certain scenarios, such as ORC in storage optimization.

The performances of these operations for each format also depend heavily on the tool you are using and how you are using it (environment and available resources). The results from Spark probably will not differ much from other more robust frameworks like Duckdb or Flink, but we recommend that you conduct your tests before making any decisions that will have a significant impact on other areas of the business.

The post Different file formats, a benchmark doing basic operations appeared first on ProdSens.live.

]]>
https://prodsens.live/2024/03/10/different-file-formats-a-benchmark-doing-basic-operations/feed/ 0
Introduction to Data Science https://prodsens.live/2023/12/28/introduction-to-data-science/?utm_source=rss&utm_medium=rss&utm_campaign=introduction-to-data-science https://prodsens.live/2023/12/28/introduction-to-data-science/#respond Thu, 28 Dec 2023 22:24:48 +0000 https://prodsens.live/2023/12/28/introduction-to-data-science/ introduction-to-data-science

Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and…

The post Introduction to Data Science appeared first on ProdSens.live.

]]>
introduction-to-data-science

Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. It involves the use of statistics, data analysis, machine learning, and related methods to understand and analyze actual phenomena with data.

The Evolution of Data Science

Data Science has evolved from statistics and data analysis over the years. With the advent of computers and an increase in data generation, the need for data processing and analysis grew. This evolution led to the development of more sophisticated data analysis methods and the emergence of machine learning and artificial intelligence as key components of data science.

Key Components of Data Science

Data Mining
Data mining involves exploring and analyzing large blocks of information to glean meaningful patterns and trends. It can involve the use of sophisticated data analysis tools to discover previously unknown, valid patterns and relationships in large data sets.

Machine Learning
Machine Learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. It focuses on the development of computer programs that can access data and use it to learn for themselves.

Big Data
Big Data refers to data that is so large, fast, or complex that it’s difficult or impossible to process using traditional methods. The act of accessing and storing large amounts of information for analytics has been around a long time, but the concept of big data gained momentum in the early 2000s.

Statistical Analysis
This refers to the collection, analysis, interpretation, presentation, and organization of data. Statistical analysis can be used in a wide range of fields, including social sciences, business, and engineering.

How Does Data Science Work?

Ask a Question: It all starts with a curious question about something we want to know.
Gather Information: Just like collecting clues, we gather all the data (information) we need.
Clean the Mess: We tidy up our data, sorting it out so it’s easier to use.
Start Detective Work: This is where we explore our data to find interesting trends or patterns.
Create a Data Model: Think of this like a mini-experiment to test our guesses about the data.
Check the Results: We see if our mini-experiment worked well.
Use Your Findings: Finally, we use what we learned to make decisions or solve problems.

Applications of Data Science

Data science has a wide range of applications including business intelligence, health care, finance, forecasting, image and speech recognition, and many more. It is used to predict customer behavior, enhance business operations, forecast trends, and make informed decisions.

Conclusion

The field of Data Science is continuously evolving as technology advances. It is becoming increasingly important in various sectors for making more informed and accurate decisions. As data continues to grow in volume and complexity, the role of data scientists is becoming more pivotal in interpreting the data for successful business outcomes.

The post Introduction to Data Science appeared first on ProdSens.live.

]]>
https://prodsens.live/2023/12/28/introduction-to-data-science/feed/ 0
Big data models 📊 vs. Computer memory 💾 https://prodsens.live/2023/11/23/big-data-models-%f0%9f%93%8a-vs-computer-memory-%f0%9f%92%be/?utm_source=rss&utm_medium=rss&utm_campaign=big-data-models-%25f0%259f%2593%258a-vs-computer-memory-%25f0%259f%2592%25be https://prodsens.live/2023/11/23/big-data-models-%f0%9f%93%8a-vs-computer-memory-%f0%9f%92%be/#respond Thu, 23 Nov 2023 13:24:47 +0000 https://prodsens.live/2023/11/23/big-data-models-%f0%9f%93%8a-vs-computer-memory-%f0%9f%92%be/ Data pipelines are the backbone of any data-intensive project. As datasets grow beyond memory size (“out-of-core”), handling them…

The post Big data models 📊 vs. Computer memory 💾 appeared first on ProdSens.live.

]]>
Data pipelines are the backbone of any data-intensive project. As datasets grow beyond memory size (“out-of-core”), handling them efficiently becomes challenging.
Dask enables effortless management of large datasets (out-of-core), offering great compatibility with Numpy and Pandas.

Pipelines

This article focuses on the seamless integration of Dask (for handling out-of-core data) with Taipy, a Python library used for pipeline orchestration and scenario management.

Taipy – Your web application builder

A little bit about us. Taipy is an open-source library designed for easy development for both front-end (GUI) and your ML/Data pipeline(s).
No other knowledge is required (no CSS, no nothing!).
It has been designed to expedite application development, from initial prototypes to production-ready applications.

QueenB stars

Star ⭐ the Taipy repository

We’re almost at 1000 stars and couldn’t do this without you🙏

1. Sample Application

Integrating Dask and Taipy is demonstrated best with an example. In this article, we’ll consider a data workflow with 4 tasks:

  • Data Preprocessing and Customer Scoring
    Read and process a large dataset using Dask.

  • Feature Engineering and Segmentation
    Score customers based on purchase behavior.

  • Segment Analysis
    Segment customers into different categories based on these scores and other factors.

  • Summary Statistics for High-Value Customers
    Analyze each customer segment to derive insights

We will explore the code of these 4 tasks in finer detail.
Note that this code is your Python code and is not using Taipy.
In a later section, we will show how you can use Taipy to model your existing data applications, and reap the benefits of its workflow orchestration with little effort.

The application will comprise of the following 5 files:

algos/
├─ algo.py  #  Our existing code with 4 tasks
data/
├─ SMALL_amazon_customers_data.csv  #  A sample dataset
app.ipynb  # Jupyter Notebook for running our sample data application
config.py  # Taipy configuration which models our data workflow
config.toml  # (Optional) Taipy configuration in TOML made using Taipy Studio

2. Introducing Taipy – A Comprehensive Solution

Taipy is more than just another orchestration tool.
Especially designed for ML engineers, data scientists, and Python developers, Taipy brings several essential and simple features.
Here are some key elements that make Taipy a compelling choice:

  1. Pipeline execution registry
    This feature enables developers and end-users to:

    • Register each pipeline execution as a “Scenario” (a graph of tasks and data nodes);
    • Precisely trace the lineage of each pipeline execution; and
    • Compare scenarios with ease, monitor KPIs and provide invaluable insight for troubleshooting and fine-tuning parameters.
  2. Pipeline versioning
    Taipy’s robust scenario management enables you to adapt your pipelines to evolving project demands effortlessly.

  3. Smart task orchestration
    Taipy allows the developer to model the network of tasks and data sources easily.
    This feature provides a built-in control over the execution of your tasks with:

    • Parallel execution of your tasks; and
    • Task “skipping”, i.e., choosing which tasks to execute and
      which to bypass.
  4. Modular approach to task orchestration
    Modularity isn’t just a buzzword with Taipy; it’s a core principle.
    Setting up tasks and data sources that can be used interchangeably, resulting in a cleaner, more maintainable codebase.

3. Introducing Dask

Dask is a popular Python package for distributed computing. The Dask API implements the familiar Pandas, Numpy and Scikit-learn APIs - which makes learning and using Dask much more pleasant for the many data scientists whom are already familiar with these APIs.
If you’re new to Dask, check out the excellent 10-minute Introduction to Dask by the Dask team.

4. Application: Customer Analysis (algos/algo.py)

DAG schema
A graph of our 4 tasks (visualized in Taipy) which we will model in the next section.

Our existing code (without Taipy) comprises of 4 functions, which you can also see in the graph above:

  • Task 1: preprocess_and_score
  • Task 2: featurization_and_segmentation
  • Task 3: segment_analysis
  • Task 4: high_value_cust_summary_statistics

You can skim through the following algos/algo.py script which defines the 4 functions and then continue reading on for a brief description of what each function does:

### algos/algo.py
import time

import dask.dataframe as dd
import pandas as pd

def preprocess_and_score(path_to_original_data: str):
    print("__________________________________________________________")
    print("1. TASK 1: DATA PREPROCESSING AND CUSTOMER SCORING ...")
    start_time = time.perf_counter()  # Start the timer

    # Step 1: Read data using Dask
    df = dd.read_csv(path_to_original_data)

    # Step 2: Simplify the customer scoring formula
    df["CUSTOMER_SCORE"] = (
        0.5 * df["TotalPurchaseAmount"] / 1000 + 0.3 * df["NumberOfPurchases"] / 10 + 0.2 * df["AverageReviewScore"]
    )

    # Save all customers to a new CSV file
    scored_df = df[["CUSTOMER_SCORE", "TotalPurchaseAmount", "NumberOfPurchases", "TotalPurchaseTime"]]

    pd_df = scored_df.compute()

    end_time = time.perf_counter()  # Stop the timer
    execution_time = (end_time - start_time) * 1000  # Calculate the time in milliseconds
    print(f"Time of Execution: {execution_time:.4f} ms")

    return pd_df

def featurization_and_segmentation(scored_df, payment_threshold, score_threshold):
    print("__________________________________________________________")
    print("2. TASK 2: FEATURE ENGINEERING AND SEGMENTATION ...")

    # payment_threshold, score_threshold = float(payment_threshold), float(score_threshold)
    start_time = time.perf_counter()  # Start the timer

    df = scored_df

    # Feature: Indicator if customer's total purchase is above the payment threshold
    df["HighSpender"] = (df["TotalPurchaseAmount"] > payment_threshold).astype(int)

    # Feature: Average time between purchases
    df["AverageTimeBetweenPurchases"] = df["TotalPurchaseTime"] / df["NumberOfPurchases"]

    # Additional computationally intensive features
    df["Interaction1"] = df["TotalPurchaseAmount"] * df["NumberOfPurchases"]
    df["Interaction2"] = df["TotalPurchaseTime"] * df["CUSTOMER_SCORE"]
    df["PolynomialFeature"] = df["TotalPurchaseAmount"] ** 2

    # Segment customers based on the score_threshold
    df["ValueSegment"] = ["High Value" if score > score_threshold else "Low Value" for score in df["CUSTOMER_SCORE"]]

    end_time = time.perf_counter()  # Stop the timer
    execution_time = (end_time - start_time) * 1000  # Calculate the time in milliseconds
    print(f"Time of Execution: {execution_time:.4f} ms")

    return df

def segment_analysis(df: pd.DataFrame, metric):
    print("__________________________________________________________")
    print("3. TASK 3: SEGMENT ANALYSIS ...")
    start_time = time.perf_counter()  # Start the timer

    # Detailed analysis for each segment: mean/median of various metrics
    segment_analysis = (
        df.groupby("ValueSegment")
        .agg(
            {
                "CUSTOMER_SCORE": metric,
                "TotalPurchaseAmount": metric,
                "NumberOfPurchases": metric,
                "TotalPurchaseTime": metric,
                "HighSpender": "sum",  # Total number of high spenders in each segment
                "AverageTimeBetweenPurchases": metric,
            }
        )
        .reset_index()
    )

    end_time = time.perf_counter()  # Stop the timer
    execution_time = (end_time - start_time) * 1000  # Calculate the time in milliseconds
    print(f"Time of Execution: {execution_time:.4f} ms")

    return segment_analysis

def high_value_cust_summary_statistics(df: pd.DataFrame, segment_analysis: pd.DataFrame, summary_statistic_type: str):
    print("__________________________________________________________")
    print("4. TASK 4: ADDITIONAL ANALYSIS BASED ON SEGMENT ANALYSIS ...")
    start_time = time.perf_counter()  # Start the timer

    # Filter out the High Value customers
    high_value_customers = df[df["ValueSegment"] == "High Value"]

    # Use summary_statistic_type to calculate different types of summary statistics
    if summary_statistic_type == "mean":
        average_purchase_high_value = high_value_customers["TotalPurchaseAmount"].mean()
    elif summary_statistic_type == "median":
        average_purchase_high_value = high_value_customers["TotalPurchaseAmount"].median()
    elif summary_statistic_type == "max":
        average_purchase_high_value = high_value_customers["TotalPurchaseAmount"].max()
    elif summary_statistic_type == "min":
        average_purchase_high_value = high_value_customers["TotalPurchaseAmount"].min()

    median_score_high_value = high_value_customers["CUSTOMER_SCORE"].median()

    # Fetch the summary statistic for 'TotalPurchaseAmount' for High Value customers from segment_analysis
    segment_statistic_high_value = segment_analysis.loc[
        segment_analysis["ValueSegment"] == "High Value", "TotalPurchaseAmount"
    ].values[0]

    # Create a DataFrame to hold the results
    result_df = pd.DataFrame(
        {
            "SummaryStatisticType": [summary_statistic_type],
            "AveragePurchaseHighValue": [average_purchase_high_value],
            "MedianScoreHighValue": [median_score_high_value],
            "SegmentAnalysisHighValue": [segment_statistic_high_value],
        }
    )

    end_time = time.perf_counter()  # Stop the timer
    execution_time = (end_time - start_time) * 1000  # Calculate the time in milliseconds
    print(f"Time of Execution: {execution_time:.4f} ms")

    return result_df

Task 1 – Data Preprocessing and Customer Scoring

Python function: preprocess_and_score
This is the first step in your pipeline and perhaps the most crucial.
It reads a large dataset using Dask, designed for larger-than-memory computation.
It then calculates a “Customer Score” in a DataFrame named scored_df, based on various metrics like “TotalPurchaseAmount”, “NumberOfPurchases”, and “AverageReviewScore”.

After reading and processing the dataset with Dask, this task will output a Pandas DataFrame for further use in the remaining 3 tasks.

Task 2 – Feature Engineering and Segmentation

Python function: featurization_and_segmentation
This task takes the scored DataFrame and adds new features, such as an indicator for high spending.
It also segments the customers based on their scores.

Task 3 – Segment Analysis

Python function: segment_analysis
This task takes the segmented DataFrame and performs a group-wise analysis based on the customer segments to calculate various metrics.

Task 4 – Summary Statistics for High-Value Customers

Python function: high_value_cust_summary_statistics
This task performs an in-depth analysis of the high-value customer segment and returns summary statistics.

5. Modelling the Workflow in Taipy (config.py)

DAG in studio
Taipy DAG — Taipy “Tasks” in orange and “Data Nodes” in blue.

In this section, we will create the Taipy configuration which models the variables/parameters (represented as “Data Nodes”) and functions (represented as “Tasks”) in Taipy.

Notice that this configuration in the following config.py script is akin to defining variables and functions — except that we are instead defining “blueprint variables” (Data Nodes) and “blueprint functions” (Tasks).
We are informing Taipy on how to call the functions we defined earlier, default values of Data Nodes (which we may overwrite at runtime), and if Tasks may be skipped:

### config.py
from taipy import Config

from algos.algo import (
    preprocess_and_score,
    featurization_and_segmentation,
    segment_analysis,
    high_value_cust_summary_statistics,
)

# -------------------- Data Nodes --------------------

path_to_data_cfg = Config.configure_data_node(id="path_to_data", default_data="data/customers_data.csv")

scored_df_cfg = Config.configure_data_node(id="scored_df")

payment_threshold_cfg = Config.configure_data_node(id="payment_threshold", default_data=1000)

score_threshold_cfg = Config.configure_data_node(id="score_threshold", default_data=1.5)

segmented_customer_df_cfg = Config.configure_data_node(id="segmented_customer_df")

metric_cfg = Config.configure_data_node(id="metric", default_data="mean")

segment_result_cfg = Config.configure_data_node(id="segment_result")

summary_statistic_type_cfg = Config.configure_data_node(id="summary_statistic_type", default_data="median")

high_value_summary_df_cfg = Config.configure_data_node(id="high_value_summary_df")

# -------------------- Tasks --------------------

preprocess_and_score_task_cfg = Config.configure_task(
    id="preprocess_and_score",
    function=preprocess_and_score,
    skippable=True,
    input=[path_to_data_cfg],
    output=[scored_df_cfg],
)

featurization_and_segmentation_task_cfg = Config.configure_task(
    id="featurization_and_segmentation",
    function=featurization_and_segmentation,
    skippable=True,
    input=[scored_df_cfg, payment_threshold_cfg, score_threshold_cfg],
    output=[segmented_customer_df_cfg],
)

segment_analysis_task_cfg = Config.configure_task(
    id="segment_analysis",
    function=segment_analysis,
    skippable=True,
    input=[segmented_customer_df_cfg, metric_cfg],
    output=[segment_result_cfg],
)

high_value_cust_summary_statistics_task_cfg = Config.configure_task(
    id="high_value_cust_summary_statistics",
    function=high_value_cust_summary_statistics,
    skippable=True,
    input=[segment_result_cfg, segmented_customer_df_cfg, summary_statistic_type_cfg],
    output=[high_value_summary_df_cfg],
)

scenario_cfg = Config.configure_scenario(
    id="scenario_1",
    task_configs=[
        preprocess_and_score_task_cfg,
        featurization_and_segmentation_task_cfg,
        segment_analysis_task_cfg,
        high_value_cust_summary_statistics_task_cfg,
    ],
)


You can read more about configuring Scenarios, Tasks and Data Nodes in the documentation here.

Taipy Studio

Taipy Studio is a VS Code extension from Taipy that allows you to build and visualize your pipelines with simple drag-and-drop interactions.
Taipy Studio provides a graphical editor where you can create your Taipy configurations stored in TOML files that your Taipy application can load to run.
The editor represents Scenarios as graphs, where nodes are Data Nodes and Tasks.

As an alternative for the config.py script in this section, you may instead use Taipy Studio to generate a config.toml configuration file.
The penultimate section in this article will provide a guide on how to create the config.toml configuration file using Taipy Studio.

6. Scenario Creation and Execution

Executing a Taipy scenario involves:

  • Loading the config;
  • Running the Taipy Core service; and
  • Creating and submitting the scenario for execution.

Here’s the basic code template:

import taipy as tp
from config import scenario_cfg  # Import the Scenario configuration
tp.Core().run()  # Start the Core service
scenario_1 = tp.create_scenario(scenario_cfg)  # Create a Scenario instance
scenario_1.submit()  # Submit the Scenario for execution

# Total runtime: 74.49s

Skip unnecessary task executions

One of Taipy’s most practical features is its ability to skip a task execution if its output is already computed.
Let’s explore this with some scenarios:

Changing Payment Threshold

# Changing Payment Threshold to 1600
scenario_1.payment_threshold.write(1600)
scenario_1.submit()

# Total runtime: 31.499s

What Happens: Taipy is intelligent enough to skip Task 1 because the payment threshold only affects Task 2.
In this case, we are seeing more than 50% reduction in execution time by running your pipeline with Taipy.

Changing Metric for Segment Analysis

# Changing metric to median
scenario_1.metric.write("median")
scenario_1.submit()

# Total runtime: 23.839s

What Happens: In this case, only Task 3 and Task 4 are affected. Taipy smartly skips Task 1 and Task 2.

Changing Summary Statistic Type

# Changing summary_statistic_type to max
scenario_1.summary_statistic_type.write("max")
scenario_1.submit()

# Total runtime: 5.084s


What Happens: Here, only Task 4 is affected, and Taipy executes only this task, skipping the rest.
Taipy’s smart task skipping is not just a time-saver; it’s a resource optimizer that becomes incredibly useful when dealing with large datasets.

7. Taipy Studio

You may use Taipy Studio to build the Taipy config.toml configuration file in place of defining the config.py script.

DAG inside Studio

First, install the Taipy Studio extension using the Extension Marketplace.

Creating the Configuration

  • Create a Config File: In VS Code, navigate to Taipy Studio, and initiate a new TOML configuration file by clicking the + button on the parameters window.

Image description

  • Then right-click on it and select Taipy: Show View.

Configuration show view

  • Adding entities to your Taipy Configurations:
    On the right-hand side of Taipy Studio, you should see a list of 3 icons that can be used to set up your pipeline.

Confirguration icon

  1. The first item is for adding a Data Node. You can link any Python object to Taipy’s Data Nodes.
  2. The second item is for adding a Task. A Task can be linked to a predefined Python function.
  3. The third item is for adding a Scenario. Taipy allows you to have more than one Scenario in a configuration.

– Data Nodes

Input Data Node: Create a Data Node named “path_to_data”, then navigate to the Details tab, add a new property “default_data”, and paste “SMALL_amazon_customers_data.csv” as the path to your dataset.

Intermediate Data Nodes: We’ll need to add four more Data Nodes: “scored_df”, “segmented_customer_df”, “segment_result”, “high_value_summary_df”. With Taipy’s intelligent design, you don’t need to configure anything for these intermediate data nodes; the system handles them smartly.

Intermediate Data Nodes with Defaults: We finally define four more intermediate Data Nodes with the “default_data” property set to the following:

  • payment_threshold: “1000:int”

datanode view

  • score_threshold: “1.5:float”
  • metric: “mean”
  • summary_statistic_type: “median”

– Tasks

Clicking on the Add Task button, you can configure a new Task.
Add four Tasks, then link each Task to the appropriate function under the Details tab.
Taipy Studio will scan through your project folder and provide a categorized list of functions to choose from, sorted by the Python file.

Task 1 (preprocess_and_score): In Taipy studio, you’d click the Task icon to add a new Task.
You’d specify the input as “path_to_data” and the output as “scored_df”.
Then, under the Details tab, you’d link this Task to the algos.algo.preprocess_and_score function.

Task Process and Score

Task 2 (featurization_and_segmentation): Similar to Task 1, you’d specify the inputs (”scored_df”, ”payment_threshold”, ”score_threshold”) and the output (”segmented_customer_df”). Link this Task to the algos.algo.featurization_and_segmentation function.

Task Featurization

Task 3 (segment_analysis): Inputs would be “segmented_customer_df” and “metric”, and the output would be “segment_result”.
Link to the algos.algo.segment_analysis function.

Task segment analysis

Task 4 (high_value_cust_summary_statistics): Inputs include “segment_result”, “segmented_customer_df”, and “summary_statistic_type”. The output is “high_value_summary_df”. Link to the algos.algo.high_value_cust_summary_statistics function.

Task Statistics

Conclusion

Taipy offers an intelligent way to build and manage data pipelines.
The skippable feature in particular, makes it a powerful tool for optimizing computational resources and time, particularly beneficial in scenarios involving large datasets.
While Dask provides the raw power for data manipulation, Taipy adds a layer of intelligence, making your pipeline not just robust but also smart.

Additional Resources
For the complete code and TOML configuration, you can visit this GitHub repository. To dive deeper into Taipy, here’s the official documentation.

Once you understand Taipy Scenario management, you become much more efficient building data driven application for your end users. Just focus on your algorithms and Taipy handles the rest.

That's a lot

Hope you enjoyed this article!

The post Big data models 📊 vs. Computer memory 💾 appeared first on ProdSens.live.

]]>
https://prodsens.live/2023/11/23/big-data-models-%f0%9f%93%8a-vs-computer-memory-%f0%9f%92%be/feed/ 0
Data Engineering For Beginners: A Step-By-Step Guide https://prodsens.live/2023/11/07/data-engineering-for-beginners-a-step-by-step-guide/?utm_source=rss&utm_medium=rss&utm_campaign=data-engineering-for-beginners-a-step-by-step-guide https://prodsens.live/2023/11/07/data-engineering-for-beginners-a-step-by-step-guide/#respond Tue, 07 Nov 2023 16:24:26 +0000 https://prodsens.live/2023/11/07/data-engineering-for-beginners-a-step-by-step-guide/ data-engineering-for-beginners:-a-step-by-step-guide

For anyone who might be interested to embark on data engineering, this article will serve as a stepping…

The post Data Engineering For Beginners: A Step-By-Step Guide appeared first on ProdSens.live.

]]>
data-engineering-for-beginners:-a-step-by-step-guide

Image description

For anyone who might be interested to embark on data engineering, this article will serve as a stepping stone to explore this field further. Data engineering offers exciting opportunities to work with awesome technologies, solve complex data challenges and contribute to the success of data-driven organizations.

By acquiring the necessary skills, staying up-to-date and gaining hands-on experience, you can embark on a rewarding career in data engineering.

**

Introduction

**
What is data engineering? — Data Engineering refers to the designing, building, and maintaining the infrastructure and systems necessary for the collection, storage, processing, and analysis of large volumes of data.

Data engineers work closely with data scientists, analysts, and other stakeholders to create robust data pipelines and enable efficient data-driven decision-making.

Roles of a Data Engineer.

  1. Design and develop data pipelines that extract, transform, and load (ETL) data from various sources into a centralized storage system.
  2. Managing Data infrastructure required to store and process large volumes of data. This includes selecting and configuring databases, data warehouses, and data lakes, as well as optimizing their performance and scalability.
  3. Data modeling and database design: Data engineers work closely with data scientists and analysts to design data models and schemas that facilitate efficient data storage and retrieval.
  4. Monitoring and maintenance: Implementing data quality checks and validation processes to ensure the accuracy, consistency, and integrity of the data.

**

Key skills and knowledge required to become a Data Engineer:

**
If you have the interest of becoming a data engineer, you need to have both technical skills and domain knowledge. Some of these skills include:

  1. Proficiency in programming languages like Python, SQL. A data engineer should be able to write efficient code to manipulate and process data and automate data workflows.
  2. Data storage and processing technologies: Data engineers should have understanding of data storage and processing technologies like as relational databases (e.g., MySQL, PostgreSQL), distributed systems (e.g., Apache Hadoop, Apache Spark), and cloud-based platforms (e.g., AWS, Azure, GCP).
  3. ETL and data integration: Should be familiar with ETL (Extract, Transform, Load) processes and tools for data integration is a must. Data Engineers should have knowledge of data integration frameworks like Apache Airflow or commercial tools like Informatica.
  4. Data modeling and database design: Should have knowledge of data modeling techniques and database design principles to design efficient database schemas and optimize query performance.
  5. Big data technologies: Knowledge of big data technologies like Hadoop, Spark, and NoSQL databases is highly valuable due to the increase in amount and complexity of data. Data engineers should be able to work with distributed computing frameworks and handle large-scale data processing
    .

**

Importance of data engineering in today’s data-driven world

**
In today’s data-driven world, data engineering plays the crucial role of enabling organizations to harness the power of data for insights and innovation. Here are some key reasons why data engineering is important:

  1. Data engineering helps organizations integrate data from various sources, such as databases, APIs, and external systems, into a unified and structured format.
  2. Data engineering ensures that the infrastructure and processes are designed to handle the increasing data demands(scaling), enabling faster data processing and analysis.
  3. Data engineering focuses on ensuring the quality and reliability of data by implementing data validation techniques for identifying and rectifying data inconsistencies, errors, and missing values.
  4. Data engineering involves implementing robust data governance and security measures to protect sensitive data and comply with regulations through access controls, encryption, data masking, and auditing mechanisms to safeguard data privacy and maintain data integrity.
  5. By building efficient data pipelines and systems, organizations can derive valuable insights from data in a timely manner. This enables stakeholders to make informed choices, identify trends, and uncover hidden patterns that can drive business growth and innovation.

**

An Overview of the data engineering process

**

  1. Collecting data.
  2. Cleansing the data
  3. Transforming the data.
  4. Processing the data.
  5. Monitoring.

Image description

**

Step-by-Step Guide to Data Engineering

**

Step 1: Defining data requirements
This involves understanding the business goals and objectives that drive the need for data analysis and decision-making. Here are two key aspects of this step:

  1. Identifying business goals and objectives

Data engineers collaborate with stakeholders in order to understand the organization’s goals and objectives. This includes identifying the business questions that need to be addressed, key performance indicators (KPIs) that need to be tracked, and the desired outcomes of data analysis. All this will ensure that the data infrastructure and processes are designed to support the organization’s specific objectives.

  1. Determining data sources and types

Data engineers work with stakeholders to determine the relevant data sources and types required to achieve the defined business goals. This will involve identifying both the internal (databases, data warehouses, or existing data lakes within the organization) and external ( APIs, third-party data providers, or publicly available datasets) data sources that contain the necessary information.

Data engineers also consider the types of the data, whether it is structured data (relational databases), semi-structured data (JSON or XML), or unstructured data (text documents or images).

Step 2: Data collection and ingestion
After defining the data requirements, the next step in the data engineering process is to collect and ingest the data into a storage system. This step involves these key activities:

  1. Extracting data from various sources

Data engineers utilize the appropriate technique and tools to extract data from the identified data sources, which include databases, APIs, files or external data providers. This will involve querying databases, making API calls, or accessing files stored in different formats.

  1. Transforming and cleaning the data

After extracting the data, data engineers transform and clean the data to ensure its quality and compatibility with the target storage system. This involves techniques like data normalization, standardization, removing duplicates and handling missing or erroneous values. Data validation checks may also be done to ensure the integrity and consistency of the collected data.

  1. Loading the data into a storage system

Once data has been transformed and cleaned, data engineers load it into a storage system for further processing and analysis. The storage system of choice will depend on the organization’s requirements and may include relational databases, data warehouses, data lakes, or cloud-based storage solutions. Data engineers then design the appropriate schema to efficiently store and organize the data in the chosen storage system.

Step 3: Data storage and management
The next step in the data engineering process is to effectively store and manage the data. This will involve the following:

  1. Choosing the appropriate storage system

A data engineer needs to evaluate the different storage systems and select the most appropriate for their particular organization. Factors like data volume, variety, velocity, scalability, performance and cost need to be considered before setting up the necessary infrastructure, defining data schemas and optimizing storage configurations. It is critical for a data engineer to ensure that the storage system chosen at this point is compatible with the data processing and analysis tools that will be used in the later steps.

  1. Implementing data governance and security measures

Data governance and security are critical aspects of data storage and management and data engineers need to ensure data quality, consistency, and compliance with existing regulations. There is also a need to implement security measures to protect the data from unauthorized access, data breaches, and other security threats by use of access controls, encryption mechanisms, data masking techniques and auditing mechanisms to ensure data privacy and maintain data integrity.

Step 4: Data processing and transformation
Data processing frameworks provide the necessary tools and infrastructure to perform complex data processing tasks efficiently for example Apache Spark, which is designed for distributed data processing.

Once the data is stored and managed, the next step is to process and transform the data to derive meaningful insights and it will involve the following:

  1. Performing data transformation and aggregation

Data engineers need to convert the raw data into a format suitable for analysis and it will involve cleaning the data, filtering the data, merging data from different sources and reshaping the data to meet specified requirements. Data engineers also perform data aggregations to summarize and condense the data, enabling easier analysis and visualization. Transforming and aggregating the data will uncover patterns, trends within the data.

  1. Handling large-scale data processing

The amount of data keeps increasing and data engineers should know how to efficiently handle large-scale data processing. This involves optimizing data processing workflows, utilizing parallel processing techniques and using distributed computing frameworks. Effective handling of large-scale data processing ensures the insights derived from the data are obtained in a timely and efficient manner.

Step 5: Data quality and validation
Data quality involves ensuring the accuracy, consistency and reliability of the data and the data quality and validation step involves the following:

  1. Ensuring data accuracy and consistency

Data engineers need to implement measures like performing data cleansing and data profiling techniques to identify and rectify any errors, inconsistencies or anomalies in the data. Data engineers also need to handle missing values, remove duplicates, and resolve data conflicts to improve accuracy and consistency of the data.

  1. Implementing data validation techniques

Data engineers implement various validation techniques to ensure that the data meets predefined standards and business rules and this will involve performing data type checks, range checks, format checks, and referential integrity checks. implementing data validation techniques helps identify and rectify data inconsistencies, errors, and handle any missing values.

  1. Monitoring data quality over time

Data engineers need to establish mechanisms to monitor data quality over time to ensure that the data remains accurate, consistent, and reliable throughout its lifecycle. This involves setting up data quality metrics and implementing data quality monitoring tools and processes. Data engineers may set up automated data quality checks, create dashboards and have alerting mechanisms in place which will promptly identify and address any data quality issues.

Step 6: Data integration and visualization
This involves combining data from various sources, creating pipelines and workflows and visualizing the data in form of dashboards and reports. The following are the steps involved:

  1. Integrating data from multiple sources

Data engineers work with various data sources and they design and implement data integration processes to extract data from these sources and transform it into a unified format. This may involve data mapping, data merging, and data cleansing techniques to ensure the data is consistent and ready for analysis.

  1. Creating data pipelines and workflows

Data engineers build data pipelines and workflows that automate the movement and processing of data. They design the flow of data from source to destination and incorporate data transformations, aggregations, and other processing steps. Data pipelines ensure that data is processed in an efficient and consistent way, enabling timely and accurate analysis. Workflow automation tools and frameworks like Apache Airflow are used to schedule and manage the data pipelines.

  1. Visualizing data for analysis and reporting

Data visualization is a tool for understanding and communicating insights from data. Data engineers collaborate with data analysts and data scientists to create visualizations to present the data and highlight key findings. Visualization tools include Tableau, Power BI or Python libraries like Matplotlib or Plotly to create interactive charts, graphs, and dashboards. The visualizations enable stakeholders to explore the data, identify patterns, and use the insights to make data-informed decisions.

**

Wrapping Up!!

**

Data engineering is a critical field that empowers organizations to harness the full potential of their data. As a data engineer you need to have familiarized yourself with basics such as programming, data manipulation that is (ETL), know how to use visualization tools such as tableau or power BI, build pipelines and also get to understand how to structure data in logical manner.

Hope you found this introduction to data engineering informative! If designing, building, and maintaining data systems at scale excites you, definitely give data engineering a go

The post Data Engineering For Beginners: A Step-By-Step Guide appeared first on ProdSens.live.

]]>
https://prodsens.live/2023/11/07/data-engineering-for-beginners-a-step-by-step-guide/feed/ 0
Exploratory Data Analysis Using Data Visualization Techniques 📊. https://prodsens.live/2023/10/15/exploratory-data-analysis-using-data-visualization-techniques-%f0%9f%93%8a/?utm_source=rss&utm_medium=rss&utm_campaign=exploratory-data-analysis-using-data-visualization-techniques-%25f0%259f%2593%258a https://prodsens.live/2023/10/15/exploratory-data-analysis-using-data-visualization-techniques-%f0%9f%93%8a/#respond Sun, 15 Oct 2023 18:24:04 +0000 https://prodsens.live/2023/10/15/exploratory-data-analysis-using-data-visualization-techniques-%f0%9f%93%8a/ exploratory-data-analysis-using-data-visualization-techniques.

Are you intrigued by the fascinating world of Data Science and eager to embark on a journey to…

The post Exploratory Data Analysis Using Data Visualization Techniques 📊. appeared first on ProdSens.live.

]]>
exploratory-data-analysis-using-data-visualization-techniques.

Data Visualization

Are you intrigued by the fascinating world of Data Science and eager to embark on a journey to unravel the hidden insights within data? If so, you’ve landed on the right path. Exploratory Data Analysis (EDA) is a critical phase in the data analysis process that involves the initial investigation of a dataset to summarize its main characteristics, often with the help of data visualization techniques.

EDA is like peeling the layers of an onion to reveal the hidden insights within the data. This article will guide you through the exciting realm of Exploratory Data Analysis (EDA) using Data Visualization Techniques, an essential step in the data science process.

Exploratory Data Analysis is a crucial step in the data analysis process. It allows data analysts and scientists to get a feel for the data, understand its characteristics, and generate hypotheses. Data visualization techniques are the tools that make EDA effective, providing insights that might otherwise remain hidden. In a data-driven world, mastering EDA is essential for making informed decisions and extracting valuable insights from your data

Data scientists serve as the bridge between raw, unprocessed data and valuable business insights. They have the unique skill set required to manipulate vast and seemingly meaningless datasets, extracting meaningful patterns and trends. This analysis, in turn, plays a crucial role in driving modern economies and assisting governments and organizations in addressing contemporary issues.

Data visualization techniques lie at the heart of this endeavor, helping data scientists and analysts make sense of the data and extract meaningful insights.

Understanding Exploratory Data Analysis

Exploratory Data Analysis, introduced by statistician John Tukey in the 1970s, is all about making sense of data without jumping to conclusions. It involves systematically examining data sets, summarizing their main characteristics, and creating visualizations to help understand the data’s structure, patterns, and anomalies.

Data visualization is at the heart of EDA. It’s the process of representing data graphically to uncover patterns, trends, and anomalies. Here are some essential data visualization techniques frequently used in EDA

The EDA Process

Data Collection: The EDA process begins with data collection. It’s essential to gather high-quality, clean data for meaningful analysis.

Data Cleaning: This step involves handling missing values, outliers, and inconsistencies in the data.

Univariate Analysis: In this stage, each variable is analyzed individually. This includes creating histograms, box plots, and summary statistics to understand their distribution.

Bivariate Analysis: Bivariate analysis explores relationships between pairs of variables. Scatter plots and correlation matrices are commonly used in this phase.

Multivariate Analysis: Multivariate analysis extends the exploration to multiple variables simultaneously. Techniques like heatmaps can be helpful.

Anomaly Detection: EDA often involves identifying and addressing outliers and anomalies in the data.

Data Visualiation techniques📉.

Data visualization involves creating graphical representations of data, making it easier for humans to understand and interpret. Here are some essential data visualization techniques and their roles in EDA:

  1. Scatter Plots
    Scatter plots are effective for visualizing the relationship between two continuous variables. They help identify patterns such as clusters, outliers, and trends. For instance, scatter plots can reveal whether there’s a correlation between a person’s age and income.

    Scatter Plots

  2. Histograms and Density Plots
    Histograms provide a visual representation of the distribution of a single variable. They can indicate whether the data follows a normal distribution or if it’s skewed. Density plots offer a smoothed version of histograms, making it easier to see underlying patterns.

    Histogram

  3. Box Plots
    Box plots display the distribution of a dataset, showing the median, quartiles, and potential outliers. They are excellent for comparing distributions between different groups or categories. For instance, box plots can help you compare the salaries of employees in different departments of a company.

    Box Plots

  4. Heatmaps
    Heatmaps are valuable for exploring relationships between multiple variables. They visualize the correlation between variables in a matrix form, making it evident which variables are strongly related and which are not.

    Heatmaps

  5. Time Series Plots
    Time series plots are ideal for visualizing data collected over time, such as stock prices, temperature, or website traffic. They help in identifying trends, seasonality, and anomalies.

    Time Series

  6. Bar Charts
    Bar charts are useful for displaying categorical data. They’re often used for comparing the frequencies or proportions of different categories. For instance, a bar chart can illustrate the market share of different smartphone brands.

    Time Series

Data Visualization: Bringing Data to Life📈

Data visualization is a pivotal aspect of data science. After performing various data operations, the ability to convey insights through visualizations is essential for effective communication. Here are some valuable resources to help you master this skill:

  • Tableau: Tableau is a powerful data visualization tool that allows you to create visually appealing and easy-to-understand charts, graphs, and dashboards. Its user-friendly interface makes it an ideal choice for data analysts and scientists.

  • D3.js: D3.js is a JavaScript library that empowers you to create interactive and dynamic data visualizations. It’s a popular choice for crafting custom and captivating visualizations.

  • Seaborn and Plotly: These Python libraries are handy for creating engaging and informative visualizations. Seaborn is known for its beautiful statistical plots, while Plotly enables you to build interactive charts.

  • Power BI Power BI, short for Power Business Intelligence, is a robust business analytics service and data visualization tool developed by Microsoft. It empowers organizations and individuals to analyze data, share insights, and make data-driven decisions.

    Power BI is a suite of software services, applications, and connectors that work together to transform raw data into visually appealing and interactive reports and dashboards.

Building Your Data Science Portfolio

As you progress in your data science journey, start building your portfolio. Creating projects and writing articles about your data analysis experiences will set you apart. Consider using platforms like GitHub to showcase your work. Kaggle is another valuable resource, providing access to extensive datasets and a community of fellow data scientists.

The post Exploratory Data Analysis Using Data Visualization Techniques 📊. appeared first on ProdSens.live.

]]>
https://prodsens.live/2023/10/15/exploratory-data-analysis-using-data-visualization-techniques-%f0%9f%93%8a/feed/ 0
A Beginner’s Guide to Building LLM-Powered Applications with LangChain! https://prodsens.live/2023/08/30/a-beginners-guide-to-building-llm-powered-applications-with-langchain/?utm_source=rss&utm_medium=rss&utm_campaign=a-beginners-guide-to-building-llm-powered-applications-with-langchain https://prodsens.live/2023/08/30/a-beginners-guide-to-building-llm-powered-applications-with-langchain/#respond Wed, 30 Aug 2023 05:25:50 +0000 https://prodsens.live/2023/08/30/a-beginners-guide-to-building-llm-powered-applications-with-langchain/ a-beginner’s-guide-to-building-llm-powered-applications-with-langchain!

If you’re a developer or simply someone passionate about technology, you’ve likely encountered AI tools such as ChatGPT.…

The post A Beginner’s Guide to Building LLM-Powered Applications with LangChain! appeared first on ProdSens.live.

]]>
a-beginner’s-guide-to-building-llm-powered-applications-with-langchain!

If you’re a developer or simply someone passionate about technology, you’ve likely encountered AI tools such as ChatGPT. These utilities are powered by advanced large language models (LLMs). Interested in taking it up a notch by crafting your own LLM-based applications? If so, LangChain is the platform for you.

Let’s keep everything aside and understand about LLMs first. Then, we can go over LangChain with a simple tutorial. Sounds interesting enough? Let’s get going.

What are Large Language Models (LLMs)?

Large Language Models (LLMs) like GPT-3 and GPT-4 from OpenAI are machine learning algorithms designed to understand and generate human-like text based on the data they’ve been trained on. These models are built using neural networks with millions or even billions of parameters, making them capable of complex tasks such as translation, summarization, question-answering, and even creative writing.

Trained on diverse and extensive datasets, often encompassing parts of the internet, books, and other texts, LLMs analyze the patterns and relationships between words and phrases to generate coherent and contextually relevant output. While they can perform a wide range of linguistic tasks, they are not conscious and don’t possess understanding or emotions, despite their ability to mimic such qualities in the text they generate.

how llms work Source Credits: NVIDIA

Large language models primarily belong to a category of deep learning structures known as transformer networks. A transformer model is a type of neural network that gains an understanding of context and significance by identifying the connections between elements in a sequence, such as the words in a given sentence.

What is LangChain?

Developed by Harrison Chase, and debuted in October 2022, LangChain serves as an open-source platform designed for constructing sturdy applications powered by Large Language Models, such as chatbots like ChatGPT and various tailor-made applications.

Langchain seeks to equip data engineers with an all-encompassing toolkit for utilizing LLMs in diverse use-cases, such as chatbots, automated question-answering, text summarization, and beyond.

LangChain is composed of 6 modules explained below:

Langchain explained Image credits: ByteByteGo

  • Large Language Models:
    LangChain serves as a standard interface that allows for interactions with a wide range of Large Language Models (LLMs).

  • Prompt Construction:
    LangChain offers a variety of classes and functions designed to simplify the process of creating and handling prompts.

  • Conversational Memory:
    LangChain incorporates memory modules that enable the management and alteration of past chat conversations, a key feature for chatbots that need to recall previous interactions.

  • Intelligent Agents:
    LangChain equips agents with a comprehensive toolkit. These agents can choose which tools to utilize based on user input.

  • Indexes:
    Indexes in LangChain are methods for organizing documents in a manner that facilitates effective interaction with LLMs.

  • Chains:
    While using a single LLM may be sufficient for simpler tasks, LangChain provides a standard interface and some commonly used implementations for chaining LLMs together for more complex applications, either among themselves or with other specialized modules.

How Does LangChain Work?

how langchain works

LangChain is composed of large amounts of data and it breaks down that data into smaller chunks which can be easily be embedded into vector store. Now, with the help of LLMs, we can retrieve the only information that is needed.

When a user inserts a prompt, LangChain will query the Vector Store for relevant information. When an exact or almost matching information is found, we feed that information to LLM to complete or generate the answer that user is looking for.

Get Started with LangChain

Let’s use SingleStore’s Notebooks feature (it is FREE to use) as our development environment for this tutorial.

singlestore sql database queries

The SingleStore Notebook extends the capabilities of Jupyter Notebook to enable data professionals to easily work and play around.

What is SingleStore?

SingleStore is a distributed, in-memory, SQL database management system designed for high-performance, high-velocity applications. It offers real-time analytics and mixes the capabilities of a traditional operational database with that of an analytical database to allow for transactions and analytics to be performed in a single system.

Signup for SingleStore to use the Notebooks.

SingleStore Notebooks feature

Once you sign up to SingleStore, you will also receive $600 worth free computing resources. So why not use this opportunity.

Click on ‘Notebooks’ and start with a blank Notebook.
singlestore notebooks usage

Name it something like ‘LangChain-Tutorial’ or as per your wish.
blank notebook

Let’s start working with our Notebook that we just created.
Follow this step by step guide and keep adding the code shown in each step in your Notebook and execute it. Let’s start!

Now, to use Langchain, let’s first install it with the pip command.

!pip install -q langchain

To work with LangChain, you need integrations with one or more model providers like OpenAI or Hugging Face. In this example, let’s leverage OpenAI’s APIs, so let’s install it.

!pip install -q openai

adding dependencies

Next, we need to setup the environment variable to playaround.
Let’s do that.

import os
os.environ["OPENAI_API_KEY"] = "Your-API-Key"

API key added

Hope you know how to get Your-API-Key, if not, go to this link to get your OpenAI API key.

[Note: Make sure you still have the quota to use your API Key]

Next, let’s get an LLM like OpenAI and predict with this model.

Let’s ask our model the top 5 most populated cities in the world.

from langchain.llms import OpenAI
llm = OpenAI(temperature=0.7)
text = "what are the 5 most populated cities in the world?"
print(llm(text))

llm prediction

As you can see, our model made a prediction and printed the 5 most populated cities in the world.

Prompt Templates

Let’s first define the prompt template.

from langchain.prompts import PromptTemplate

# Creating a prompt
prompt = PromptTemplate(
    input_variables=["input"],
    template="what are the 5 most {input} cities in the world?",
)

prompt template

We created our prompt. To get a prediction, let’s now call the format method and pass it an input.

get predictions

Creating Chains

So far, we’ve seen how to initialize a LLM model, and how to get a prediction with this model. Now, let’s take a step forward and chain these steps using the LLMChain class.

from langchain.chains import LLMChain
# Instancing a LLM model
llm = OpenAI(temperature=0.7)
# Creating a prompt
prompt = PromptTemplate(
  input_variables=["attribute"],
  template= "What is the largest {attribute} in the world?",
)

creating chains

You can see the prediction of the model.

Developing an Application Using LangChain LLM

Again use SingleStore’s Notebooks as the development environment.

Let’s develop a very simple chat application.

Start with a blank Notebook and name it as per your wish.

  • First, install the dependencies.
pip install langchain openai

langchain using notebooks

  • Next, import the installed dependencies.
from langchain import ConversationChain, OpenAI, PromptTemplate, LLMChain

from langchain.memory import ConversationBufferWindowMemory

Import libraries

  • Get your OpenAI API key and save it safely.
    safe openai api key

  • Add and customize the LLM template

# Customize the LLM template 
template = """Assistant is a large language model trained by OpenAI.

{history}
Human: {human_input}
Assistant:"""

prompt = PromptTemplate(input_variables=["history", "human_input"], template=template)

customize LLM

  • Load the ChatGPT chain with your API key you saved safely. Add the human input as ‘What is SingleStore?’. You can change your input to whatever you want.
chatgpt_chain = LLMChain(

           llm=OpenAI(openai_api_key="YOUR-API-KEY",temperature=0),
           prompt=prompt,
           verbose=True,
           memory=ConversationBufferWindowMemory(k=2),

           )
# Predict a sentence using the chatgpt chain
output = chatgpt_chain.predict(
       human_input="What is SingleStore?"
       )
# Display the model's response
print(output)

The script initializes the LLM chain using the OpenAI API key and a preset prompt. It then takes user input and shows the resulting output.

The expected output is as shown below,
LLM and chat application

Play with this by changing the human input text/content.

The complete execution steps in code format is available on GitHub.

LangChain emerges as an indispensable framework for data engineers and developers striving to build cutting-edge applications powered by large language models. Unlike traditional tools and platforms, LangChain offers a more robust and versatile framework tailored for complex AI applications. LangChain is not just another tool in a developer’s arsenal; it’s a transformative framework that redefines what is possible in the realm of AI-powered applications.

In the realm of GenAI applications and LLMs, it is highly recommended to know about vector databases. I recently wrote a complete overview on vector databases, you might like to go through that article.

Note: There is much more to learning LLMs and LangChain and this is my first attempt in writing something about this framework. This is not a complete guide on LangChains. Please go through more articles and tutorials to understand in-depth about LangChain.

Don’t forget to signup for SingleStore to use the free Notebooks feature. Play around & have fun learning.

Disclaimer: ChatGPT assisted with only some sections of this article.

The post A Beginner’s Guide to Building LLM-Powered Applications with LangChain! appeared first on ProdSens.live.

]]>
https://prodsens.live/2023/08/30/a-beginners-guide-to-building-llm-powered-applications-with-langchain/feed/ 0
Building ETL/ELT Pipelines For Data Engineers. https://prodsens.live/2023/08/22/building-etl-elt-pipelines-for-data-engineers/?utm_source=rss&utm_medium=rss&utm_campaign=building-etl-elt-pipelines-for-data-engineers https://prodsens.live/2023/08/22/building-etl-elt-pipelines-for-data-engineers/#respond Tue, 22 Aug 2023 03:25:43 +0000 https://prodsens.live/2023/08/22/building-etl-elt-pipelines-for-data-engineers/ building-etl/elt-pipelines-for-data-engineers.

Introduction: When it comes to processing data for analytical purposes, ETL (Extraction, Transformation, Load) and ELT (Extract, Load,…

The post Building ETL/ELT Pipelines For Data Engineers. appeared first on ProdSens.live.

]]>
building-etl/elt-pipelines-for-data-engineers.

Introduction:

When it comes to processing data for analytical purposes, ETL (Extraction, Transformation, Load) and ELT (Extract, Load, Transform) pipelines play a pivotal role. In this article, we will delve into the definitions of these two processes, explore their respective use cases, and provide recommendations on which to employ based on different scenarios.

Defining ETL and ELT:

ETL, which stands for Extraction, Transformation, and Load, involves the extraction of data from various sources, transforming it to meet specific requirements, and then loading it into a target destination. On the other hand, ELT, or Extract, Load, Transform, encompasses the extraction of raw data, loading it into a target system, and subsequently transforming it as needed. Both ETL and ELT serve the purpose of preparing data for advanced analytics.

ETL Process Utilization:

ETL pipelines are often used in situations involving legacy systems, where data engineers respond to ad hoc business requests and intricate data transformations. This process ensures that data is refined before being loaded into the target system, enhancing its quality and relevance.
ETL

ELT Process Preference:

ELT pipelines have gained preference due to their swifter execution compared to ETL pipelines. Additionally, the setup costs associated with ELT are lower, as analytics teams do not need to be involved from the outset, unlike ETL where their early engagement is required. Furthermore, ELT benefits from heightened security measures within the data warehouse itself, whereas ETL necessitates engineers to layer security measures.

ELT

Exploring a Practical ETL Pipeline Project:

To gain hands-on experience in Data Engineering, cloud technologies, and data warehousing, consider working on the project “Build an ETL Pipeline with DBT, Snowflake, and Airflow.” This project provides a solid foundation and equips you with valuable skills that stand out in the field. The tools employed in this project are:

  • DBT (Data Build Tool) for ETL processing.
  • Airflow for orchestration, building upon knowledge from a previous article.
  • Snowflake as the data warehouse.

Conclusion

In conclusion, we have gained a comprehensive understanding of ETL and ELT pipelines, including their distinctive use cases. By considering the scenarios outlined here, you can make informed decisions about which approach suits your data processing needs. As a recommendation, engaging in the mentioned project will undoubtedly enhance your expertise in Data Engineering, cloud technologies, and data warehousing. Embark on this journey to stand out and continue your learning in this dynamic field.

Happy learning!

The post Building ETL/ELT Pipelines For Data Engineers. appeared first on ProdSens.live.

]]>
https://prodsens.live/2023/08/22/building-etl-elt-pipelines-for-data-engineers/feed/ 0