[00:00:00] Today. Welcome to another expert data summit presentation. This one's by Martin Gardner. His title is Solution Principal. He works for an international consultancy called Slalom. He works in the Salesforce part of that consultancy, though, they have other abilities that he'll tell you about.
[00:00:32] And his presentation today is about a pattern for data migration. It's more about all the things that go into a data migration. And I've looked through it, and I think it's really interesting and really important. And it's something that people miss when they concentrate on just the technical aspects of data migration.
[00:00:49] So without further ado, here's Martin who we're going to learn from today.
[00:00:55] Well, thank you very much. Yes. My name is Martin, and I'm here today [00:01:00] to talk about a data migration pattern that I've developed a number of years. I am a solution principal at Slalom. We're a large consultancy. I'm working the London office in the Salesforce practice. We also work with AWS, Tablo, Snowflake, and have a large operation that builds things from the ground up ourselves. As you may be aware, if you've done data migration previously, they are quite difficult. They're more than just a technical exercise. There are lots of things you need to think about when you're going to be doing a data migration.
[00:01:39] This session is really going to go through all the things that you need to remember or consider when you're building out a data migration project. So how can we possibly remember all these things? Well, if we think of them as a pyramid, that works quite nicely. So we have [00:02:00] four areas that we're going to look at.
[00:02:03] We're going to have data, methods, technology, and people. And I'm going to address each one of these in turn and break them down and show you what goes into each of them, and what things you need to think about, remember, or take action on in order to make your data migration successful. So, let's look at the first thing, which is data.
[00:02:30] How can we ensure, firstly, that the data we're using in our process is of good quality? We're not going to be able to say that we've done successfully migrated our data if it arrives in the target system completely scrambled or messed up in any way. So we need to decide how to measure that quality.
[00:02:55] What rules or KPIs are we going to put in place at the beginning of the [00:03:00] entry point time migration that will drive out an understanding of the quality of that data as it came in, and which we can then use at the other end to verify that we've not made it any worse.
[00:03:17] It's quite tempting in a data migration to cleanse the data, to clean it up as you bring it across, and that sometimes involves merging records. So if you're looking for duper cuts and merging them, that's quite tricky to then test on. So if you're going to do that, I recommend you separate that out into a separate project or a separate phase.
[00:03:41] So you do your deduplication and merging before you attempt to migrate because one of the really easy ways to check you migrated the right thing is to count the number of records in the source encountered in the target. If you've merged them in the middle of your process, you can't then rely on [00:04:00] that as a metric for telling whether or not you're off by a few records or a few thousand records.
[00:04:05] So really you need to be careful about doing that kind of data quality stuff during a data migration. So I would suggest place some gates at the beginning and the end of your process that ensure a certain level of quality for your data and then use that as a means to measuring how the process is going.
[00:04:28] Some of the things you might find in your source data that you might need to deal with are email addresses, failed lengths, data types, mandatory fields with null values, and broken referential integrity. With Salesforce to Salesforce migrations, you'll probably see fewer of theme and address problems because Salesforce email fields have some validation already, and you're unlikely to find referential integrity problems
[00:04:55] As long as you're building relationships, using lookups and [00:05:00] master details, and not storing IDs in text fields and things like that. However, if you're pulling from any other data source, SQL server or, less managed system, then you may well find these things. And so you want to be looking out for those in your source data as part of your input,
[00:05:17]KPIs.
[00:05:21] So when we've got an idea of what the quality is, we want to make sure that we don't let into the process poor quality data. So we've divided our roles for entry, and then we screen out any records that don't meet those rules and pass them back to the client to resolve before they go through the process.
[00:05:43] That way, we ensure that when we get into the target system, the data there is of a known level of quality and can be accepted at UAT. Next up in the data area location. This one is a more kind of practical aspect is like, where is the [00:06:00] data? Is it in the cloud, or are we going to store it on-premise?
[00:06:04] Where's it going to be stored when it's being migrated? Is it going to be stored in memory? Is he going to be cached to a disk somewhere? And also, is it going to move across international borders? So are we migrating our data from China to the US or to from the US to Europe through China? What are those kinds of implications?
[00:06:30] Are there any legal implications to that that we need to consider? Next up is data volume. We really do need to have a clear understanding of how much data we're moving and how quickly we need to move it. That means we can then take that information and work out whether the tools we are going to choose
[00:06:51] will be able to support that on you if data. Now, if we do that upfront at the beginning of the project, we can tailor that to choose to the volumes of [00:07:00] data we're moving. We can also check that the target system can support that volume of data. There are, in the Salesforce world, there are several different ways of managing large amounts of data.
[00:07:13] So you may well want to consider whether it's appropriate to migrate that data into Salesforce or whether it's into a more cost-effective data store and then access data using external objects, or something like that. Finally, and most importantly here in the data area is security.
[00:07:34] So, for that, we need to know. Where is the data going to be? Who's looking after it? And who's got access to that information? How are we going to keep control of it? Do we have any requirements on us under GDPR or other legislation to ensure it's secure? I recommend with the handling of the data to ensure that [00:08:00]the client retains
[00:08:03] the data on their own systems. So it's usually best to set up the client, set up a virtual machine, and then we do the work on the virtual machine that's in their domain of control so that they're not passing that data to us to process. That keeps things nice and simple. So here we are.
[00:08:23] These are the tips for this area. We want to define those quality rules for the data entering the migration pipeline. Don't cleanse the data during the migration. Do it either before or after. Estimate how long it's going to take for you to do the migration based on data volumes and the limitations of the platform you're using. Then, wherever possible, keep that data in the client's domain of control and responsibility so that you can keep the legal aspects simple.
[00:08:54] So moving on now to methods. Methods area, that's, this is going to [00:09:00] cover all of the things that you might use to do the moving of the data. So things like, what processes are you going to use? How are you going to set things up? What methodology might you use for the project? And how might you test that sort of thing?
[00:09:15] So first of all, the contract. I'm approaching this from the point of view of a consultancy, but you could take this information, and use it in an internal project because really you are going to be setting up a contract between you and your users or your, or whoever's at the stakeholder for this, this project.
[00:09:35] And you want to be very clear about what's going to happen in each phase. I recommend that you break the project down into two distinct phases. The discovery and design, first. Followed by the build and test and execution, after that. The reason for that is, and I've seen this on a lot of projects,
[00:09:55] is that it's very difficult when you first come into the environment [00:10:00] looking at, you know, a spec from a customer to really understand quite how many fields, how many objects. Well, to really understand the complexity of what's needed to do to achieve the migration. So, rather than having to over commit or under commit, maybe before you go into the project, start off with a phase where you go and look at all the objects that you're going to be migrating, all of the data you can get your hands on.
[00:10:30] To get to a point where you've got a much more accurate estimate for how long an expense type to implement.
[00:10:41] So what should you put into that discovery and design phase? So you're going to be wanting to look at your data migration strategy, which is the who, how, when, where, and what/if of the migration. You're going to want to look at the data mapping. So [00:11:00] data mapping at an object and field level. What high-level transformations are you expecting to make?
[00:11:07] How many objects and fields are you going to be working with? And then if you take all of that, the build effort is going to be proportional to that object and field count.
[00:11:24] So this, this slide here just shows a little bit more detail about what's in that data strategy document. So we're looking to cover how are we going to do the migration. So who's gonna be responsible to what? What environments are you going to use? Do we need any servers setup or any cloud services set up and initialized for us?
[00:11:49]hat are we going to do if things go wrong? So we'll need a rollback strategy. How do we delete the data from the target environment? What processes are we going to use? Who's going to sign off on things? [00:12:00] A release schedule. So when are we going to do what? Who's going to do those activities?
[00:12:06] A definition of what the data quality rules will be for the project. And then finally things around how we'll test it, how we will log the processing of the data migration, and who's going to sign off on it, and how that will be reported.
[00:12:25] The data migration mapping document takes it a step further. This is well into the design part, so we're looking here at trying to describe everything that's relevant about the source and target systems or the objects, fields that we're going to be mapping, the data types. What we're really trying to do is make things as simple as possible for us to implement.
[00:12:49] So we're looking for identifying potential problems around data formats, sizes of fields, data type mismatches, complex transformations [00:13:00] that might need a staging database to implement or some really custom coding. And then, we want to work out our execution order. With Salesforce and many systems where you have relationships between objects you have to be careful about the order in which you load them because,
[00:13:17] for example, a contact can't be loaded before the account that it is a child off because it needs the account ID to load against it, to be linked to a. Potentially, you could load everything without any parent IDs on them, and then an update subsequently to add the IDs in. However, there are lots of objects in Salesforce for which now won't work, especially where they're most detailed.
[00:13:41] So you will need to plan the order in which you're going to go into load each object, and you might have to plan for revisiting objects to update them once when you find loops in the dependency tree.
[00:13:59] So [00:14:00] how would we go about building this? I would recommend a classic ETL platform for a data migration. So we're going to extract the data from the source system, stage somewhere, transform the data, and then load it into the target system. And the way I would break that down would be into five separate jobs.
[00:14:23] So, in order to make things easy to test, it's nice to be able to pull everything out of the stage of the source system into the staging database, stop, and then check that we pulled out the number of records we're expecting, that we haven't got any error messages, that everything seems to load correctly,
[00:14:44] and then we can move on to mapping. So, by breaking everything down into different steps, we isolate each step, we can test each step in isolation, and we can repeat each step without having to go all the way back to the beginning. So if we [00:15:00] get towards the end of our process, we can undo a certain number of steps and then redo them without having to redo, clear down everything and start again, which with a very complex migration and save a lot of time and headaches.
[00:15:15] So, I break it down into these five jobs or processes: a get, a map, a put, a delete, and a wrong. A get pulls everything out to the staging database. That's right. Pulls everything out the source database into a staging database. The map part does all of the transformations to change the input schema into something that looks a lot like the target schema.
[00:15:40] The only thing that map won't be able to do is resolve foreign keys and references. So, for example, if we're loading an account into Salesforce, it will be given a Salesforce account ID that will be unique to that record in that all. And when we then try and load the contacts, we need them to be loaded with an [00:16:00] account ID that that matches the right record for them.
[00:16:04] We have the information then in the staging database that tells us what the ID was for the account we were looking for. We'll use an external ID to find the right account, and then we use that account's ID to update that. That can only happen in the put phase of the process where we've already put the accounts in and we're putting the contacts in.
[00:16:32] You can sometimes do that stuff in the map, but if moved creates a dependency. That means that the account has to be got mapped and put before you can do the mapping of the contacts, which may not be what you want to do. So if you want to keep things vertically isolated, you ensure that you can run all of the get jobs, then all the maps, then all the puts without having interdependencies between [00:17:00] a map and a put separately.
[00:17:03] So moving on to what does the put do. That just loads into the target system. So that's going to take the record from our staging database. That looks a lot like targeting that we, we're loading too, and just push it in. It might do this foreign key resolution to look up based on external IDs, parent records, or look up values that we're after,
[00:17:23] but that's pretty much it. And then it's going to report back on the process and the success of that. The final two processes here are help rs really. So you have a delete one, which is your rollback for each particular object. And a run, which allows you to orchestrate the process for an entity or the whole thing.
[00:17:44] And you can be as granular as you won't care. On the right here, we've got an example of a run job where it's chaining together the gap, the map, the put, and then the delete. So this was, you would use it in this way when you're debugging or testing, you've got to step [00:18:00] through and execute the process.
[00:18:02] Run it through, check that the dates there, and then remove the data to make it, leave it clean from another wrong. So here is an example of a good process. So we have a query for the Salesforce data, and then we're going to stage that directly into a database table. At this point, we're not going to change the schema.
[00:18:24] We're going to create a replica table in a SQL database that exactly reflects what we pulled out the source system. We can then use that to verify that we got the right data, and we can also go back afterwards. So if we get to the end of the process and the client checks the data and said, "Oh, there's a problem in this field, the value is wrong," we can trace that back
[00:18:48] all the way through our staging tables back to the source record that we pulled in and verify that either the data is correct, it has been related correctly, or [00:19:00] we can identify where in the process the date field was transformed incorrectly. This would be a typical map process, so we are now loading from that staging table into, pull the records out, and we're performing some transformations.
[00:19:16]So recoding, account references, the industry fields pick lists that need to be modified or transformed. This is the part, the point in the process where we do that. And the output is a mapped table in our staging database that represents as close as possible the target table structure.
[00:19:39] Here we have a put process, which again is quite simple. It's just reading from that database table and pushing that into our target system.
[00:19:54] So out of that, as a result of running that put process, we'll have a table with [00:20:00] some IDs in it for the newly created records, so we can take that and push that into Salesforce through the delete operation to remove them out. And here we have the orchestration again of all those four processes into one.
[00:20:17] So you may ask, why would I suggest using a database for staging the data? And in which case, what kind of things should I do with it? Well, I find the database quite useful, especially with complex data, data migrations, because it gives me somewhere to stash things, and come back to them. So I can cache the data I pulled out, so I don't have to go back to the source system and pull it out again.
[00:20:43]In the past, I've had some where it's quite slow to pull the data around. So you want to minimize the number of times you're pulling that data out from your source system in order to speed up your end-to-end process. So how would you set it up? Well, keep the logging separate from the [00:21:00] actual migration data.
[00:21:01] So in SQLserver, that would be two different schemas and use a prefix for the tables so that you can follow the entities through the system. So you've got source account, that'd be the source data that you pulled in from the source tables. Target or TGT account would be the transforms data that, so the map process will build the TGT table.
[00:21:22] And then the put process will then generate the out account data, which is there a record of what data was actually loaded into the system. Any reference data session should brief prefixed with ref. So, for example, if you're mapping users from one system or another, you need some reference user tables. So you using that prefix, it keeps it clear.
[00:21:42] It's clearly understandable, the purpose of each table when you don't end up accidentally loading the wrong thing. So what do we do when we're testing the data? This is the final area for methods. We want to make sure that it matches the [00:22:00] production data quality. On daily tests with production data.
[00:22:03] That is, that can be troublesome. So you may well have to implement procedures that control access to that system when you're testing to ensure that you would buy it by the appropriate regulations. Use the database as an audit log and proof for your testing. So build in those KPIs and broke out into the audit
[00:22:25] Tracking your process so that it's easy to just run a SQL recall, you know, run a report over the execution and say, this is the result. This is what happened. And you're not having to troll through log files, trying to work out.
[00:22:41] This is a typical process for testing a data migration. You're going to have a unit testing phase where you're looking at each particular job in isolation, making sure that it works on its own. Then you go into some end-to-end testing where you're building the whole thing together and starting to practice and get an idea of how [00:23:00] long it's going to take with representative data volumes.
[00:23:04] Your UAT phase is where you get the users back in and get them to check that it looks right. You want to be minimizing the number of changes and fixes you're doing the process. And you really do want to be testing on representative if not live data. Then you want to do some dry runs, three or four dry runs, probably best.
[00:23:24]so that you can iron out all the problems before you hit them. Otherwise, you'll be up until midnight when they go live on the weekend, trying to fix some data problems that you could have found several weeks ago. So that's it for methods. So separating out the discovery from the delivery, that gives you an opportunity to reestimate everything and agree the validation method and testing responsibilities early.
[00:23:51]Be really clear about who's taking the risk for the migration, especially if you're contracting, using time and materials or fixed price. [00:24:00] So, technology then. For that, this area, we need to understand the source and target of that database migration. Are we coming from Salesforce to Salesforce, SQL server over to Salesforce CRM Dynamics?
[00:24:16] Yeah. What systems are we doing? And we need to know a lot of information about that, so and understand the limitations that we'll be working on that. For example, in Salesforce, will we be able to turn off triggers and automation when we load in order to improve the performance? If we do that, we'll have to or may have to ensure that we can either
[00:24:37]trigger the behavior that those triggers would have run after migration, or we need to incorporate whatever mappings or what or coded, changes that that would have done into our process itself. So, there is a plethora of tools that we can use, and we probably want to consider the data volumes when choosing those tools
[00:24:59][00:25:00] alongside that, so you can use Excel and Excel plugins to load things, move volumes in. The Salesforce Import Wizards can do after about 50,000. I imagine that would be pretty simple records. You have a single object with no other references. Perhaps the Salesforce Data Loader, you can get a bit more complicated.
[00:25:20]But again, it gets a bit painful with large volumes of data because it's very difficult to validate huge text files. And then you can use more commercial product products, which can make it a lot easier to load very large data volumes and do a lot more complex transformations. Final area on technology limits.
[00:25:41] So with Salesforce, in particular, there are some limits that you need to look out for. If you're using the bulk API, for example, you can only do 10,000 batches in a rolling 24-hour period. And I have had it on some projects where we've had to reduce the batch size down to between one or 10 records [00:26:00] and for many thousands of records
[00:26:02] and that can cause you to go over that limit. In that case, if you were hitting that sort of thing where you can only load one record at a time because of various triggers that you can't turn off in Salesforce, you're going to have to switch from using the bulk API to using the SOAP API because that doesn't have the same limit.
[00:26:19]It will not run as fast, but then given that you're writing one or two records at a time, you're not really gaining anything from using bulk anyway. oOe of the other ones that will catch people out are the 200,000, number of content versions that can be published in a 24 period. So if you have more than 200,000 files that you want to load into your socket Salesforce system, you aren't going to be able to get those in within the, well, you're not going to be able to load them and get them published within 24 hours anyway.
[00:26:56] So you will probably need to plan out, for that [00:27:00] taking longer than 24 hours.
[00:27:04] So in the, in summary then for the technology area, find all your source systems, select tools you already know how to use, and then check the limits that you're going to be operating under.
[00:27:20] And now, people. So, there are many people involved in and make data migrations. The first one I'm going to talk about here is the data owner. So they're the person who understands the significance of the data migration. They're really the owner of the system that's being migrated from all too, and they'll be accepting now of UAT and the dry run testing.
[00:27:45] So quite an important person, and they will need to take ownership for either everything or you'll have to have multiple data owners, but ideally, a single person taking responsibility for every object you're migrating. This would be someone in the client [00:28:00] that would be perhaps the sponsor for the project.
[00:28:06] The data owner then probably would need to nominate a subject matter expert who would be someone in their team who really understands where all the skeletons are hidden in data. They know the quality, and they can help us to answer the questions that will come up around what should be mapped.
[00:28:23] Why is the data like it is? Where should this go? How's it being used? So will be critical to the project. The dev team then is going to need either one person in each of these roles or, you know you could share the roles somewhere, but you need these skills covered by the people on your team.
[00:28:44] We'll need a project manager to manage the interaction with the client, track milestones, and report progress and keep things aligned and help with prioritization. You'll need a data migration architect who will be able to draft the migration strategy and lead [00:29:00] the ETL developers. A few, one or more ETL developers to design and build the ETL processes according to the migration strategy.
[00:29:10]optionally, a DVA or access to one for your stadium database to ensure that it's properly tuned, and works correctly. And then a really good test to who will be able to validate the migrated data and process follows the design and it's acceptable.
[00:29:32] On to of the pyramid here for people is the skills area. So we need to have a team of people who really understand ETL technology, especially in the development dev team. And it's good if the data owner of the data SME can understand that as well. So we'll need to tell them about how ETL processes work, what unique keys are, and
[00:29:57] what's going to happen if we do create that sort of [00:30:00] stuff. The data SMEs are going to need a little bit more knowledge of, of data and how it's moved around because they are going to be able to, we're going to need to talk to them about it at that level. So, they'll need to know about foreign keys and unit keys and entity-relationship diagrams and field metadata.
[00:30:20] Now, dev team, they really need to be expert in ETL processes, and ideally expert in Salesforce eight APIs as well, so that they can avoid the Salesforce sculptures or otherwise trip people are, and it's best if you pick a team well and pick, pick a set of tools that matches the team or a team that matches tools that you want to use is really not a good idea to start
[00:30:44] A new migration with a completely unfamiliar toolset because that will, the whole way mountain time is going to take.
[00:30:54] So. Summing up there in the people area. Keep clear ownership of the data. [00:31:00] it's clearly, it's got to be accepted by the business or by the data owner at the end of the project. And access to the SME, the data resume is critical. It can really slow a project up if it's difficult to get hold of someone who can answer the questions about.
[00:31:14] But back to the those being migrated. And then, as I mentioned earlier, pick a team that knows the tools you're going to use or pick tools that you already know, and that should ensure success. So, in summary, then, data migrations are difficult, but there are four things, four key areas we can think of that help us to be successful, and they are data, methods, technology, and people.
[00:31:47] Thank you very much. Any questions? Thank you, Martin. Yeah. I'm going to ask a couple of questions that I hope maybe the audience would have asked or not. The first one is, understanding how important documentation is and agreeing with [00:32:00] you a hundred percent on that. Do you have any tools that you like to use to document the Salesforce data before he migrated,
[00:32:06] or is it mostly homegrown type, you know, Excel sheets or word documents or whatever? So mostly, it ends up being XL spreadsheets because they're portable. You can hand them over to someone, and they can use them. I do have a few tools that I've built in Python to extract the Salesforce metadata easily into tables, which can then be used.
[00:32:29]One project I did, write some stuff in a Google Spreadsheet as well to do some checking across tamps so you'd have a tab for objects, fields, and in the mappings, and then try and validate that the data types you're mapping were correct. But it's quite difficult to communicate complicated mappings, whichever way you do it.
[00:32:48] So it's, yeah, it's tricky sometimes. Yeah. So no silver bullet yet for that, huh? No, not yet. On the UAT [00:33:00] when I've been involved in projects like this. It seems like UAT is a place where things can fall down, where things can slip through the cracks. You know, the best efforts of the development team, even if they're, they're still gonna miss it because they're not subject matter experts, and sometimes the subject matter experts aren't as engaged as they should be.
[00:33:18] Do you try to put the UAT criteria or a testing plan into this contract? Or have it written down somewhere? How much formalizing do you think is write them out? That kind of stuff. I think it's, it's kind of difficult to build a lot into the contract for, you know, the scape of the migration. I think one of the key things is to be clear about where the responsibility lies.
[00:33:46] And I think the UAT should be done by the client and led by the clients because they've got to accept the data when it comes down to the process. And I think one of the things to do to make it easier [00:34:00] at the beginning of the project is to identify what the KPIs are around this data, what does really matter about it.
[00:34:08] So if it's a financial institution, that's usually money. So we might, for example, sum up the amount of assets under management and make sure that we're within a certain tolerance of that when we've migrated there and to try wherever possible to automate the testing, the migration, down to a few simple reports or queries to get a good feel for where the data is correct or wrong,
[00:34:34]and follow that up with some targeted record-by-record eyeballing, particular potentially problematic records. So that's what doing some data quality profiling at the beginning is useful because that can throw up the curveballs in the source data that might cause problems during the transformation and migration.
[00:34:55] So that allows you to then go, Oh, okay, we've got a particular scenario. We know it's going to be [00:35:00] problematic. We'll create a test case for that. Will you show, have test data? The exercises that I will look at in detail on that. So you might do that for, you know, areas where you think there's going to be a difficult transformation, but also
[00:35:12]important customers. For example, you might have some very high-profile customers. You want to make sure that their data's correct. Some less-important customers were even not so worried about in your data set. So you would prioritize testing the fare data and making sure it looks right and appraise correctly.
[00:35:31] So smart. Is there a SQL, since you're, you're using a relational database as sort of your middle, is your cache kind of, for want of a better term? Is there one that you liked better or you think is better suited for this task, or is that more of a customer down to the customer? I've built it in my SQL on a laptop because it was free.
[00:35:52] But for a client you have, Microsoft has done the same thing on SQL server. There are some kind of slight differences to the way [00:36:00] in the way that SQL Server and MySQL work, which change how, what you can do. But yes, generally, I don't have much of a preference. It's really only about having somewhere to store the data and run some reports on it and whatever.
[00:36:14] Yeah. Yeah. Well, I think it's a fascinating and often overlooked part of the whole integration process, and I really appreciate your time today and your expertise. Thank you, Martin. Thank you very much. You're welcome.