The 119th episode of the IoT Use Case Podcast is about the innovative integration and use of data in companies, illustrated by specific use cases such as the “Product Carbon Footprint”, customer data reconciliation and AI-based model training.
The episode highlights how the companies Steadforce and Starburst Data jointly develop and implement innovative data solutions. Madeleine Mickeleit welcomes two experts: Stephan Schiffner, CTO of Steadforce, and Roland Mackert, Head Alliances & Ecosystem; Partner Manager at Starburst Data.
Episode 119 at a glance (and click):
- [11:12] Challenges, potentials and status quo – This is what the use case looks like in practice
- [26:56] Solutions, offerings and services – A look at the technologies used
Podcast episode summary
A central topic of the podcast episode is the concept of the data mesh, developed by Zhamak Dehghani. Data Mesh breaks down traditional, monolithic and centralized data structures and looks at data from a new perspective. The concept is based on four principles: Domain ownership, viewing data as a product, promoting self-service and supported governance.
Both experts discuss various use cases that are made possible by this type of data integration. One example is the correlation of sales and production data in order to compare customer information and carry out analyses. Another example is the “product carbon footprint”, where data from various sources such as ERP systems and production data must be merged.
Finally, Stephan Schiffner and Roland Mackert emphasize the advantages of decentralized data approaches. These enable companies to access their data flexibly and efficiently and use it for business decisions and analyses. Starburst Data’s technology, which is based on the Trino SQL query engine, plays a decisive role in this.
Podcast interview
Hello Stephan, hello Roland, welcome to the IoT Use Case Podcast. I’m really happy that you’re here today. I’m really looking forward to all the insights you’re sharing today. Stephan, how are you and where can I reach you right now?
Stephan
Hello Madeleine. Thank you very much for the invitation. You can reach me today at the office in Munich.
Very nice. Roland, where are you? Are you also traveling in Munich or where can I reach you right now?
Roland
Hello Madeleine. I am happy to be here. Today I can be found in my home office, also near Munich.
Shoutout to Munich here. I’ll be back in Munich next time. We should meet in person for coffee sometime. Let’s get into the topic. Stephan, you at Steadforce are an IT systems consultancy, you do a lot of individual projects, but you have a very specific offering. You focus on the topic of data acquisition, i.e. taking data to the next level with different data types. You are very much involved with the topic of data use: how can I build applications on this data? The topic of data need: what do I actually do with the data? What is the use case development behind it? The topic of data management and operations, i.e. how this customer solution is then operated and on which infrastructure. I think you have over 100 experts with you. Did I say that correctly?
Stephan
Yes, that sounded very good. I would actually like to go into a little more detail. We are dedicated to helping our customers build data-based software solutions that generate added value. One of my tasks in this context is also to further develop our entire service portfolio. We are currently in the process of narrowing that down, simply to two sub-areas. To set up the entire data management area, starting from data strategy, use case development, up to infrastructure, for example, in the cloud or with a streaming solution. How can I acquire and provide the data? On the other hand, building use cases on it that go into the analytics area or are also AI-based. This with a focus on the healthcare, automotive and industrial, manufacturing and chemical sectors. These are the core areas.
Very nice. I don’t know if you can actually name any companies, but you are, so to speak, also active across sectors in very different industries and in the business sector. Depending on the customer segment, you are probably quite broadly based, aren’t you?
Stephan
Exactly. Of course, it’s always a bit difficult to name customers, as they are actually very different sizes, from small start-ups, where we are also a bit of a business enabler, to very large corporations that are listed on the DAX. Very, very broad.
You can see a few references on your homepage. I’ll link it in the show notes. It’s really exciting to see the customers you work with. Now you’ve brought Roland with you. How did you actually get together? Is there a personal story or how do you work together?
Roland
Yes, as Starburst we are a software provider and we provide a data access platform. Because we focus on technology, we need competent business partners to help us implement the solutions for our customers. We work with companies such as Steadforce, which use their expertise in the data environment and also in architecture to integrate and use our software for customers and thus generate added value for the customer from the technology.
Stephan
It is always important to us that we don’t have to reinvent the wheel for every customer. That’s why having strategic partnerships with tool providers, such as Starburst, is a good thing. We actually come in at the point where it’s really about the topic of consulting Because we know the customer’s system landscape, we know the use cases, we have an overview of the data and what is to be done with it. We are involved in consulting and integration and that complements each other perfectly.
Very nice. Now you’ve given me the perfect transition. What use cases do you address together with Roland and the Starburst Data team? What kind of use cases are there today?
Stephan
Basically, we are building data mesh solutions. In other words, this is initially independent of a specific business use case per se, but is ultimately about building a data infrastructure to enable analytics solutions.
This topic of data mesh is a concept, initially a big buzzword. I think some listeners have heard of it before, others perhaps not at all. Roland, what is Data Mesh all about?
Roland
Data Mesh is a new concept for handling data. This was developed by Zhamak Dehghani. The aim is to break down the challenges posed by monolithic, centralized data structures and look at data from a new perspective. Ultimately, there are four basic principles. This is domain ownership, i.e. the responsibility for the data no longer lies with IT, but with the people who have the technical responsibility for the data. Then to regard data as a product that must be meaningful, trustworthy, understandable, structured, and then also to make the data easily available, i.e. to support the self-service concept. Whoever needs the data uses it. Last but not least, are there specific rules or standards that need to be followed, which are governed by the overarching data governance? I have different areas in one company. Sales or development and the different types of data generated there. The sales department would then have ownership of its sales data, i.e. who is the customer, customer master data. For example, the development team has ownership of the data related to the products. How is a product structured? What is it used for? If you then bring this together, I can collect the product data from a customer in a data mesh and implement various use cases in which it makes sense to combine the data. In other words, if I want to know which customer has which products and how they are maintained, it makes sense to combine the data.
I said it briefly in the intro, but of course it’s also exciting to find out what you actually do and what skills you have to offer. A quick word about you. Starburst was founded in Boston in 2017. You have focused very strongly on open source SQL engines, we will find out what that is. You enable data from different sources, either hosted locally or in the cloud, to be queried and merged. It’s always about different systems, where ownership also plays a major role. What is your vision here? How does it all work together?
Roland
Our vision is to make it as easy as possible for our customers to access the data and derive the greatest possible benefit from it. This is made possible by our analytics platform, which works particularly well in DataLakes and also allows other data sources to be integrated. Bringing data together, for example, in a data warehouse, CRM system, or other databases, in a fast and efficient manner without moving the data. This means that I don’t have to extract the data from the systems, transform it and transfer it somewhere else. Instead, we read the data where it is and then make it available to the customer for analytics purposes.
[11:12] Challenges, potentials and status quo – This is what the use case looks like in practice
Being able to access the data directly at its source and use it without prior transformation is a major advantage. Could you go into more detail about the challenges your customers face on a day-to-day basis without this solution? What typical problems do IT departments or other teams have to deal with if they don’t have this solution?
Roland
Yes, very much so. On average, our customers move and copy their data 4-5 times after collecting it before they are actually able to analyze this data. We can significantly shorten this process by not requiring the data to be read from the systems just to be stored somewhere else, but rather by reading the data where it is stored. This also brings the benefit of being up-to-date. I don’t have to wait until the next batch run has taken place, I use the current data that is available there. Access is provided much faster than if I first have to move the data into such a data pipeline. This saves customers weeks and months in the implementation of projects. So it’s not just a matter of a few hours until I have access to the data. This also gives me flexibility. If my data changes, my queries change, the analyses I have change. Then I can map this much faster than if I have to go through this whole data transformation process again. Of course there are systems, such as data warehouses, which are very useful and help the customer and in which the data is also entered in this classic way. These systems are often used to do more than what they are actually needed for, which means inflexibility and high costs for the customer. We also help to connect a data warehouse, for example. The customer does not have to switch off or restructure their data sources. We take what is available, integrate it and thus help the customer to optimize both costs and access. Steadforce helps us to plan and implement this.
Stephan
Exactly, and what we observe time and again is simply the IT system landscapes that have grown at the customer’s premises, where a bunch of different systems with different data pools are in use. Exactly, and what we often observe are customer IT landscapes that have grown over time, with a multitude of different systems containing various data repositories. Is the content of what I have in one database the same semantic information as in another database? The surrounding infrastructure is also a challenge. Even if I make the data available first, I still need tools to determine, for example, data access security, who is allowed to access and analyze the data at all. This whole complex of data governance is then quite important. The data mesh approach and the SQL engine, which is also included in the Starburst product, are of course very important and helpful tools.
I’d like to translate what you’ve just said into practice a little more. You just gave this great example of data ownership. If we look at the development and sales area, for example, we can take a closer look at the challenges you have just described, where companies are losing time and money these days. Duplicates of data are created, which then have to be transformed. In order to carry out analyses, you need a certain degree of scalability, including in the IT system behind it. These systems have grown historically and are correspondingly complex. Could you explain this using a specific example, for example why the data needs to be transformed at all?
Stephan
Now let’s imagine I have a sales database that I use for sales and Mr. Meyer is in it. This is a data set and at the same time a production database. Screws are produced and there is an order and the customer is Mr. Meyer. But now nobody can actually tell me, is Mr. Meyer in the production control database the same Mr. Meyer as in the sales database? This is a classic problem when I have two separate data pools. The resulting problem is that I cannot easily correlate this information with each other. For example, I could now look at how many orders were actually produced by Mr. Meyer in the end. Accordingly, I cannot do any analytics on it. With the help of a data mesh, I can now provide both pieces of information curated on one side. This means that, as part of domain ownership, the specialist departments know their data and understand the semantics behind it. You can prepare the data accordingly to ensure a certain level of quality. That’s the first step. With the Starburst product, I am now able to correlate these two data sources so that I can easily run SQL queries on them and then create reports.
Thank you for the example. How do customers do this without your solution? Do I have to take these individual data sets, do the translation work, and then create them in the system? How do companies do it without you?
Stephan
A classic approach, or rather an approach that has become the standard in recent years, is to extract all data from various data sources and transfer it to a central location, such as a data lake or data warehouse, using data pipelines. However, this also means that someone has to develop these pipelines, which is usually the internal IT department. However, this IT department is often not as familiar with the data as the specialist departments, which can lead to bottlenecks. This means that the more requirements there are in the company to transfer data to this central location, the more personnel are required to do so. Roland can certainly say something about this from his experience. Another problem is that errors in these pipeline systems, which become increasingly complex over time, or when changes are required, result in considerable effort and costs. In comparison, the Data Mesh solution allows data to be connected without moving it, which can be more efficient.
Right. This is an important point that I would like to emphasize. Many companies have historically grown systems and are faced with the challenge of modernizing them in a scalable way. Some are trying to solve this internally, while there are also solutions on the market that already enable the scalability and integration of data. This is actually an area in which your company is active and offers solutions.
Roland
The customer can also continue to follow the same procedure as before. This means that if a customer wants to make changes and move towards a data mesh approach, they don’t have to start from scratch. It’s not as if it switches from one day to the next and everything is different the next day. It is a gradual process. We can access the existing data sources that are currently in use, while allowing the customer to gradually introduce the new data mesh approach. This includes a change in data responsibilities and the gradual introduction of new governance. For example, if a customer decides to store their data in the cloud and build a cross-border Data Mesh, we can establish access to the data where it currently resides. We also know where the customer plans to store their data in the future. We provide transparent access to the data for analysis without having to know exactly where it will be at the end of the day. In this way, we can support the customer in their gradual transition to a data mesh.
Yes, it’s interesting, and coming back to the business case, we often talk about business use cases that sound impressive at first glance. However, it is important to understand that the implementation of such use cases requires a scalable infrastructure. If we look at the “product carbon footprint” use case, for example, we have to merge data from different sources such as the ERP system and production. One challenge is to take into account different data types and designations in the various sources in order to make the data usable for such business applications. A scalable infrastructure and the ability to integrate and harmonize data from different sources are crucial to successfully implement such business requirements, right?
Stephan
Exactly, the availability of this data is crucial in any case. It may also be helpful or necessary to create a metadata schema that can be used in the various specialist departments. This determines how the data must ultimately fit together. However, linking data from different sources is particularly important for many use cases, as it enables new insights in the analysis and may lead to process changes that were not previously considered. For example, it can lead to correlations that were not previously identified.
Yes, absolutely. The topic of Data Mesh, when viewed as a technology or as what you collectively offer, may sound abstract, but it actually provides the necessary infrastructure and scalability to implement such projects. Perhaps you, Roland, can explain in more detail what types of data are important for your projects. This is probably less about real-time data from devices or machines and more about certain types of data. Could you explain this in more detail?
Roland
Yes, very much so. In the banking environment, it is highly unlikely that the account-holding system in which the transactions are carried out will be accessed directly in order to read this data. However, this data is also outsourced to systems by the banks for other purposes in order to make it available for further processing. On the one hand, the data we use is structured data, i.e. data from relational databases. However, it can also be semi-structured data. This could be temperature data, for example. They are also available in different formats and we can merge them. We can also use streaming data that displays a specific event and also allows for dynamics in order to be able to combine this with the static data that is stored in databases, for example.
Confluent, for example, is also one of our partners based on the Apache Kafka standard. Is this a data source for you where you basically say, okay, there are streaming data available for a specific use case that is simply necessary, and do you access that? Are they a kind of partner for you? How do you work with such technologies that are now coming into Confluent, for example, or from the streaming environment?
Roland
Absolutely. Confluent, or Kafka, is a system to which we have a connector. This means that we can connect precisely this data. For us, this is like a data source that we use to provide event information for analyses or to bring together certain things triggered by events. If the temperature sensor exceeds a certain value, it is not only important to trigger the alarm, but also to investigate which sensor was affected, in which machine it is located, which customer has the machine and which framework agreement or maintenance contract this customer has. In this way, a comprehensive overview can be created based on data from both the system and the sensor’s environment, as well as the specific events that have taken place.
Confluent is probably also dealing with masses of real-time data. This is probably high-frequency data that contains use cases such as when I want to analyze a damage claim or perhaps react in real time. This means that you can also connect data types that are relevant for these use cases.
Roland
Yes, absolutely. Kafka and Confluent are sources we use.
Stephan
All this data, the sources are connected and what I can then do with the help of the solution is to access these data sources with SQL queries, regardless of whether they are actually relational data or semi-structured data. The moment I have them in, I can run evaluations on them with a relatively standard tool.
These are like triggers or what exactly is a query at this point?
Stephan
Query is just a request. If, for example, I have an Excel file as a data source and a database, then I can’t just run an SQL query like that. But the moment we connect them to the system via connectors, which we will certainly get to later, I can still make them analyzable with the help of SQL.
What are the technological requirements for such solutions that come primarily from your customers and also in the collaboration with Stephan and Steadforce? What are the technological requirements for solutions like yours?
Roland
The technological requirements are certainly the ability to access the data at all. Of course, the infrastructure must also allow access. In addition to the technological requirements, there are also organizational requirements that we have to take into account, such as the right to access the data and to obtain approvals for it. These are issues that we definitely need to tackle and resolve as part of the projects.
[26:56] Solutions, offerings and services – A look at the technologies used
Stephan, what is important when it comes to implementation? You’ve already mentioned that these are often historically grown systems and that you can connect a wide variety of data to them, but how do you really go about it?
Stephan
If you take a closer look, our data mesh is initially a comprehensive concept and, as Roland mentioned at the beginning, implementation is a step-by-step process. This means that it is not possible to start a data mesh project overnight and complete it immediately. What has proven successful is to actually start with a use case, to build one or two data products like this, to set up an initial MVP to show that I can now generate added value with it. The organization can learn how to implement this accordingly. I’m not just talking about the technological side, but also about the responsibilities that may be involved. This is the first step from the specialist side. The second question is how can I implement this technically? It’s helpful if you only have a small use case to start with. There are then simply two possibilities. Either a partner like us, in this case Starburst, takes on the implementation, or internal IT is empowered to carry out the project, possibly in collaboration with external experts. We offer support in the form of training and architectural advice to ensure that the project is implemented successfully.
Such a use case would be, for example, what I said about the product carbon footprint or what you said about this topic: You want to make a cross-check with Mr. Meyer. Is this the same customer? What did he order and so on? Use cases like that. It probably doesn’t make sense to say I want to do condition monitoring of a sensor now, because it’s about overarching system data that you need there.
Stephan
Exactly, because the topic of correlating these data sources is already the focus. A very important part is at the beginning, for example, when we hold workshops with our customers to develop this use case or to evaluate it and determine whether it is something that we can implement within a reasonable period of time. What are the probabilities of success and what do I actually want to achieve with this? What can we now implement within a reasonable period of time?
We have already talked about specific data. For example, I now have development data or data from Mr. Meyer in the CRM or I have certain contract data where the ownership lies, for example, in sales or in production. How does this data acquisition for the use case work from these individual IT systems, i.e. from the data management systems, if you like? How does that work with your solution?
Roland
Yes, if we have the appropriate approvals, we can link into the individual systems via our connectors. We offer the options there to cover the rules and regulations, access rights and all the security issues that the customer ultimately needs. Not everyone should be able to see all the data and access it at will. We also offer the option of compiling certain data products on our platform in order to make them available for analysis.
Stephan, are these connectors that are always already there by default? There are so many different IT systems out there. Is everything already available or how does that work?
Stephan
Of course, there are not off-the-shelf solutions for everything. However, it is worth mentioning that Starburst offers an extensive selection of pre-built connectors that are very well suited for the first steps and can already cover many requirements. Nevertheless, individual requirements may arise that require customized connectors. In such cases, we can provide advice and help with the implementation of individual connectors.
You spoke earlier about this SQL query engine to retrieve this data. What is the technology behind it? What are you building on, what kind of connectors are these?
Stephan
The technology behind it is called Trino and it is actually a federated query engine that can be used to make such queries. You can compare that a bit with Kafka and Confluent, since we just mentioned it earlier. In other words, Kafka as a streaming solution, open source, which is then provided by Confluent when it comes to the enterprise sector. The situation is similar here with Trino and Starburst. This means that Trino is the open source engine that is ultimately used and Starburst is the manufacturer for the enterprise solution.
Is this hierarchical underneath or above Kafka? Kafka is always based on data in motion. These are then specific data pipelines where these standards are queried. Is that then hierarchically the same, because they are simply different systems and that depends on the use case, or is that above or underneath it?
Stephan
I would say they are different use cases. Kafka is all about getting data in motion for streaming. Our data mesh page is more about evaluation, or if we stick with the Trino example, where I really want to build queries on it. Kafka can then be a data source that can be connected directly, but it doesn’t have to be. Or I can have Kafka as a data source and three others and then make a comprehensive query to perform a specific analysis.
Roland, you said earlier that you leave the data where it is. In other words, you use this query mechanism to access the data exactly where it is located. This means that the data can be acquired via the connector and you access the data and it stays there.
Roland
That’s right. We read out the data and make it available for analytics. We also summarize them by being able to connect different sources, but the data is not being copied anywhere. At least not normally, but we read them.
Exactly, so it becomes as scalable as possible and you can design the architecture accordingly without perhaps moving it to a data warehouse or data lake. Have I understood this correctly? It’s a different kind of architecture that you’re creating, isn’t it?
Roland
Yes, exactly, we use the data lakes, we use the sources that are there, but we don’t move the data. This gives the customer the flexibility and speed they want. Trino, we have just mentioned the open source area, is the query engine. Trino is also the core of Starburst. Starburst is based on Trino. Trino is our core and we extend it. Trino is the query engine, which is a so-called NPP query engine. This means that we have our own computing power in there and are therefore highly performant and can also search through and analyze very, very large amounts of data. This means that we are scaling in the petabyte range. Trino emerged from Presto and Presto was originally developed at Facebook for precisely this purpose, to be able to analyze very, very large amounts of data in data lakes using standard SQL.
The last question for today I would ask a little bit in the direction of data analysis. We have now addressed two use cases, for example this topic of product carbon footprint or the one with Mr. Meyer. How do you now perform this evaluation, the analytics for the individual use cases? How do I analyze the data, how does it work?
Roland
If I now look at it from a Starburst perspective, I have different roles here that work with the data. In other words, for example, I have the Data and Business Analyst, I have the Data Scientist and so on. I have different people, different profiles, who work with the data. By having a standard SQL, we then give those profiles or employees the opportunity to work with the analysis tool of their choice if it supports our standard SQL interfaces. This gives customers the freedom to use what they consider to be the ideal tools for their particular role.
Stephan
Exactly, we’re at the interface. This means that with the product and the data mesh approach, I ultimately build up the database so that I have data of the appropriate quality for the use cases that follow. But how and what I put on it is of course up to me. These can be simple reportings, evaluations and dashboards and whatever else you might need, through to more complex AI-based use cases where you can use the data for model training and then build your own applications based on it.
If you have any questions or potential for discussion, please feel free to discuss them with Roland and Stephan. I would link your LinkedIn contacts accordingly in the show notes. But for now, thank you for this exciting presentation of how the whole thing works in practice. What makes you special, also in the interaction, is that you create a certain data ownership situation. This means that you have the option of accessing different systems and, for example, development, production or even sales have the option of preparing their own data. You could almost call it a marketplace situation. Everyone has sovereignty over their data and prepares it as they need it. However, you have the option of accessing these systems, leaving the data where it is and then carrying out the analytics for these use cases. Together with Steadforce, in order to enable the various business use cases. Thank you for also explaining what the business case is, i.e. what I save in terms of time and money, because in the end it’s always a question of whether I do it myself or buy it. I found that it was understood very well. So thank you very much from my side. You are welcome to elaborate a little or add to it, but I would like to hand over the last word to you for today.
Roland
Thank you, Madeleine. In my view, the summary sums it up very, very well. Thank you very much for this. Data continues to play a major and growing role for customers. I believe that the data mesh concept, data ownership in the domains, also offers many users and customers the opportunity to make data more profitably available to the company. You even briefly mentioned offering a marketplace for those who benefit from it. You will then very quickly find out which data is helpful and which is perhaps less so. This means that the company can of course continue to optimize itself successively based on the data.
Stephan
Thank you from my side too, Madeleine. I think the whole concept of decentralization is something that we also see in software development, especially in the context of microservices. There is a certain parallel between these concepts, and the data mesh concept aims to decentralize data products rather than relying on a central data source.
That was a nice summary at the end. Thank you and have a nice rest of the week. Take care, ciao!