Building a Data Engineer AI Agent ·

In this tutorial, we’ll walk you through the process of creating a Data Engineer Agent on the Wabee AI Platform. We’ll explore a common use case involving fuzzy data matching and demonstrate how the agent can help you solve complex data engineering tasks with ease.

Example Use Case: Fuzzy Data Matching

To guide our exploration of the Wabee Platform, we’ll outline a common problem encountered by data engineers and analysts working in a corporate setting. You’ll soon see that a troublesome and tedious task will turn into a brief and leisurely conversation with our capable AI agents.

Problem Description:
Let’s suppose you are working as a data scientist for a Phone Marketplace business and have been given two datasets, “new_web_profiles.csv” and “old_web_profiles.csv”.

Both datasets contain customer profiles e.g. names, addresses, registration date built by collecting data on their webpage. However, the site was updated last quarter and a few pieces of information were added, renamed and removed, making it hard to merge the two csvs into one comprehensive dataset. They’re hoping to use this data to compute some statistics regarding usage data so they can help understand which market segments are profitable.

Before we dive more into the details of the problem and our potential solutions, let’s get situated in the Wabee Platform.

Download the dummy data to follow along:
New Web Profiles
Old Web Profiles

Note: This data is fictitous and was synthetically generated by an LLM

1. Agent Creation

Let’s start by creating an agent.
Once you’ve logged in, Select ‘Agents’ from the left sidebar.
Afterwards, your screen should look like this.

Next, Click the “Create Agent” button in the top right

For this walkthrough we’ll just be needing a Single Agent, but for more sophisticated use cases a Team of Agents could be more appropriate.

We now face two choices, we can create an agent from scratch or we can just use one of our premade agents. Our Data Analyst agent should be up to the task so let’s go ahead and click to configure it.

Here we can edit every aspect of our agent, its’ goals, guardrails, beliefs etc. The prefilled configuration will work for our needs, so go ahead and provide your agent with a name and click “Next”.

Our agents use tools to help them acomplish tasks. Without tools like the “Web Browser”, our agent wouldn’t be able to access the web to search for relevant information. Proper tool selection is crucial for enabling but also for strategically limiting the range of actions an agent may take. Our default Data Analyst requires the tools listed in red, add the tools and press “Create Agent” to proceed.

If you’ve followed the steps above successfully, we should be met with this message notifying us that the Agent is being deployed.

2. Adding Files

Once the chat interface loads it means our agent is ready to go! At this point, we need to perform two crucial steps before the agent begins its’ work:

Provide the relevant data and files for the agent’s tasks
Instruct the agent by describing the task you’d like performed

We can give the agent access to files in two ways. We can select “Agents” for the sidebar again, find our agent and click the 3-dot options button

Once there, we can navigate to Storage in the Horizontal Menu and click the “Upload” button" to add the csv files we’re trying to merge.

Alternatively, if we are inside the chat interface, we can add files by using the standard attachment (paperclip) button

3. Prompting and Instruction

Recall that our problem involves joining two datasets of customer profiles, each generated by a different system. The storage formats and standards are annoyingly disjoint.

For example, in one file the customer’s email address is stored as “Email” and in another as “Email Address”, and notice also that in one file the customer’s first and last name are stored in different columns, whereas in the other there is only a single full name column. Certainly, even with difficulties like these, we can still manually investigate the file and write a relatively short Python script to perform the join. This will likely be tedious and uninteresting, and to accomodate all the edge cases that a job like this may involve will involve some trial and error.

Instead, let’s see how quickly and painlessly this task can be performed by our Data Analyst Agent.

Head over to the chat window by clicking the “Chat” button if you are not there already. Here we can provide the agent with some context on the task and set it on its’ way.

Try prompting the agent as follows:

You have two CSVs in the inputs/ folder. These files contain our company customer data from two different system. I need you to standardize these two datasets. Perform the match based in the email address only and save the result in a CSV in the same folder.

Once the agent is running, we should see something similar to this screen.

4. Review Agent Outputs

After churning away for approximately 1 minute, we should get a succesful result from the agent.

The agent describes the steps it took to solve the task, and we can review its’ intermediate reasoning steps and conclusions by clicking the “Show Reasoning” text

The agent has told us that it has stored its’ output in the file “standardized_profiles.csv”. To access it, we need to return to the agent storage we used to upload our files in step (2).

As promised, we see a merged dataset containing the condensed information originally held by our two initial csvs. You’re encouraged to explore this dataset and ask yourself whether you would have joined the files in precisely the same way? If anything is different from your expectations, you can easily correct the agent by mentioning what exactly you’d like changed.

It is also quite useful to click the logs button in the chat interface, and inspect the detailed steps undertaken by the agent so you have more context as to why and how the agent made its’ decisions. If an agent ever fails, you will also get clear insight into why so that a correction will be straightforward.

5. Next Steps

We now encourage you to try to give more autonomy to the agent. Try prompting the agent again, this time, don’t tell the agent how to perform the join. Hopefully you’ll find that the agent finds an even more robust and impressive solution to the one we’ve already explored. Very minimal intervention is required for routine tasks and it is often better to let Wabee Agents explore and optimize on their own than to mico-manage them. Of course, on the Wabee Platform you are the boss so feel free to direct the agents as you like.

Please let us know how you’re getting along, our team is active on discord to resolve problems, to discuss techniques and to plan improvements!