Preparing for the medium-term impact of AI on data engineering and analytics
We keep hearing hype around how AI will change our world. Hype cycle aside, for a coder exploring current LLM capabilities, it’s easy to see specific ways in which LLMs are gaining capabilities that overlap with what we do in our work. It’s not hard to imagine that these capabilities will mature to make a real difference in our day-to-day.
Predictions are hard at this point. There are many uncertainties about the long term, and we will need to see how LLM capabilities and product functionality evolve over time. But in this post, I wanted to contribute some high-level observations on how AI might affect our work in the medium term (say, the next 18 months) and some advice on how to prepare for it for data engineers that haven’t been too hands-on with AI tools so far. I’ll focus on two areas: coding and data usage.
How AI might change the coding experience
AI in coding is currently one of the biggest application areas for LLMs, as seen in products like Github Copilot and Cursor. Code suggestions allow you to often get pretty good code written for you in response to a comment (# Connect to Snowflake using SnowPark
) or a function definition (def filter_lines(lines, field, value):
). Beyond code suggestions, we’re seeing AI coding tools expand to higher-level tasks, like making a plan and executing changes across several files (“Agent Mode”).
As a coder with a family, with limited time for side projects, I’ve really enjoyed working with Copilot for personal projects. It’s helped me actually release software in languages that I didn’t know before (…still don’t…). But it’s also given me lots of incorrect or bad code, and I’ve had to rely on careful prompting and on my past programming experience to check on everything the AI is doing.
I think this trend will continue even as AI capabilities improve. The AI will perform best when we are able to give it specific instructions for what we want, often pointing it in the right direction with notes on where to make changes or what frameworks to use. And we will have to validate everything the AI produces. At first, this will mean reading every line of code the AI generates. But as we know from human code review, this is not perfect — even a careful reviewer can let subtle mistakes slip through.
Thus, I think automated validations will play an increasingly large role in making AI coding work for us. This can take many possible forms. It could mean human-written integration tests that we check new patch sets against. It could mean having the AI do “test-driven development:” asking it to write unit tests first, checking those over, and then having it produce code that passes the tests. Just as AI product developers are increasingly making sure they have “evals” in place for LLM outputs, as coders, investing in automated validations will give us more confidence in AI-written code, which in turn will allow us to move much more quickly and trusting AI-generated code in more places.
My advice for coders:
- Get used to working alongside AI. If you haven’t tried Github Copilot, do so. See in what cases it can “read your mind” and in what cases it gives a close but not quite right suggestion. But go beyond that, too: Try chatting with it and asking it to make larger changes. See what kind of context and guidance you need to give it to produce good results.
- Invest in “validations” now. A good test suite has always been a major enabler for engineering productivity in tech organizations, and I think it’ll be increasingly important as AI gains capabilities to accelerate our work — as long as we have good validations in place.
How AI might change data usage
As a data scientist, I’m always worried about people reaching the wrong conclusions from their data analyses, so my mental alarm bells are blaring as AI pushes into data analytics spaces. I think we are headed into a world where AI systems will happily create a set of analytics for you based on a human-language prompt — but the analytics might be wrong or misleading in ways that are very difficult to notice.
My first piece of advice would be to set our analysts (human and AI!) up for success. Be very careful and intentional about how we model our data. Clearly define our data fields and metrics. Describe in detail how to join different data sources. Always map between human terms (“business definitions”) and database terms (rows and columns). This work is super helpful in general. Beyond making our data more usable, it’s also up-front “thinking work” that sharpens our ability to figure out what we should be looking at to begin with. Do this hard work up front.
My second piece of advice is for data analysts: similarly to coding, let’s get used to working with AI and validating its output. Power BI can create a dashboard from a human-language question: But where does it get things right and where does it get things wrong? What does the “validation” process look like for a human expert to check what it’s done? Starting to get a sense of these things will really help us advise our non-technical business users who will want to start asking Copilot for analysis more often.
Finally: This is another good reason for us to work on our human relationships. When a technical expert objects to an analysis or report created by someone nontechnical, it can sound like nit-picking that doesn’t really acknowledge or answer the needs that prompted the report in the first place. And when a business owner hears technical complaints about a report, it can sound like technical nit-picking and a manageable risk, even when it has the possibility to send us down the wrong path for months or years to come. We can start working our organizational muscles now to build mutual understanding, alignment, and trust. As technologists and analysts, we need to make sure the organization is getting correct and well-understood answers from its data, but we also need to make sure that the organization’s needs for data-driven answers are fulfilled. Let’s make sure we’re establishing trusting partnerships now so we’re in the loop and can weigh in appropriately when new AI-generated insights begin to disseminate across the organization.