Join Achim Koch, Principal Technical Architect at 51ºÚÁϲ»´òìÈ, to learn how to migrate a legacy DAM to 51ºÚÁϲ»´òìÈ Experience Manager Assets. Gain insights into stakeholder analysis, resource planning, data transformation, and best practices like using CSV files for data handling. Build a roadmap for your own 51ºÚÁϲ»´òìÈ Experience Manager migration projects.
Hello and welcome to this session on how to migrate a legacy dam system to AEM Assets. We are often asked the question, is there an out-of-the-box tool that can migrate from system XYZ to AEM Assets? Short answer is no. There are probably hundreds of different products and custom built-in solutions out there. It would not be economical for 51ºÚÁϲ»´òìÈ to invest in tools that are only used once or twice. But there is not nothing. A migration does not have to start from scratch. There are plenty of proven practices and tools that can support you and your own migration project. My name is Arin Koch. I’m a principal technical architect with 51ºÚÁϲ»´òìÈ. I’m working for some 25 years now in the web and content management industry. Almost 20 years of which with 51ºÚÁϲ»´òìÈ Experience Manager or CQ or Communique as it was called in its early days. Fun fact, my first project was a CQ2 to CQ3 migration. Since then, I have done countless AEM projects. In the beginning, these were mostly science projects. But recently I see a trend that more and more of our customers migrate to AEM Assets. And I have been involved in quite a few migration projects myself in the past years. Here I want to share some common patterns I have observed and the lessons I have learned. If time permits, I like to also blog about my experience. This session was inspired by a longer article series I have written this year. So if you want to go into more details, you can find and follow me on medium.com. From a high level view, a migration project consists of five stages. Classic project planning, implementation planning, the AEM implementation and the migration script implementation. At some point in time, you will gain enough confidence in your migration scripts. Then you would run the migration script to your new AEM production system. This is then called the migration execution. Of course, these stages are not executed one after the other. Usually they overlap, run in parallel or are even executed in iterations. I believe you have done some migration projects yourself and you might recognize these stages. Even though these stages might be executed in parallel, it’s still worth conceptualizing them as separate stages. This helps you figuring out responsibilities and dependencies. Last but not least, it massively helps agreeing on a common terminology. Ask three people what migration means. The first will say the whole end-to-end process. The second will refer to writing the migration scripts and the last to the actual content migration to production. This depends on their respective background and role in the project. You will be surprised how many roles can be involved in a simple migration project. This brings us to the first stage, project planning. Project planning for a migration project is no different from planning any other ATEEB project. You will analyze who the stakeholders are, what resources are required and in what timeline the project should be delivered. In my personal experience, the stakeholder analysis is often underrated. You will be surprised how many people are involved in such a migration project. The execution of a migration is a pure technical procedure. But to reach that point, a larger number of people need to be involved. Let me give you a few examples. Know that this is not a comprehensive list but should inspire you to ask the proper questions to gather your stakeholders. The sponsor is the person with the budget. He has the last word in any decision. If all goes well, there will be only a few touch points. In my experience, however, migration projects often come with tough decisions and compromises. So it’s a good idea to keep the sponsor in the loop and well-informed. Business users or better power users are the people who are working with the current legacy system and will be working with AEM in the future on a day-to-day basis. They need to be involved in the requirement analysis of the new system and are most likely the go-to people that can help you understand the legacy system. Remember, there is a reason why you or your customers want to migrate off the legacy system. If that system was well maintained and well documented, there probably would not be a reason to migrate at all. So be prepared to have long and frequent deep-dive sessions with business users. In the planning phase, we must get a strong commitment that these power users are available throughout the whole project. But be also prepared that they will not be able to answer all questions for you. They are just users of an as-is system and rarely have they been involved in decisions that have been made when the system was first set up. For migration, you must get some sort of backend access to the legacy system to extract the binary and metadata. Identify who technically owns and supports that system and make sure they are available throughout the whole project. Usually, the preparation of the execution takes several months and several iterations. A simple data dump you make at the beginning of the project probably is out of date when you execute the migration. Also, a good legacy support is able to fill in the knowledge gaps a business user might have. If you are a project manager, make sure to have some extra budget set aside for legacy support, as the normal support contract might not cover project work. Last in my examples involve you or your customers’ IT systems administrators. When the legacy system is provided on-prem, they help you gain access. In any case, the blueprint I’ll share later uses some cloud infrastructure as intermediate storage. Somebody must be able to provision that cloud storage for the project. Make sure to document the roles you have identified in the analysis. What are their respective responsibilities, what knowledge can be expected and what decisions can these roles make? A racy matrix is a good start for such documentation. Resource planning is super important. It is not sufficient to name and assign the roles. Also, it must be ensured that the people assigned have enough time set aside from their day-to-day job. Related to the resource planning is the timeline planning. When are which human and technical resources available? Keep in mind that people have holidays every now and then and typically are not available when you need them most. Also, there are times when the day-to-day business is more demanding than at other times and availability might be smaller. For example, in retail, the Christmas season usually is quite busy and slows down projects. Last but not least, for planning of the migration execution, identify off-limit windows. Again, a retail company would probably not want to have a live migration during their peak business times. Also, keep in mind that there could be a hard deadline. For example, when the license contract of the legacy system expires. All this is not planned in a vacuum but needs to be closely aligned with the technical planning. For example, how do you want to migrate? Continuously synchronizing running the legacy and the AEM system in parallel for some time or in a big bank fashion. Plus, check how long the execution takes and how business continuity can be ensured during that phase.
This brings us to the technical part of the planning. This roughly can be divided in three disciplines. Requirements analysis for the new system, how data is transformed from legacy to AEM, and what infrastructure is available and required for the migration. Requirement analysis for a migration is a bit more complex compared to a greenfield project when you start from scratch. Requirements can be a bit more fuzzy, like same features as the old system or all formats need to be supported. Ah yes, and there are a couple of downstream systems attached to that legacy system, so AEM is expected to be fully backwards compatible. Such requirements are a good start or a failed project. That is why it is super important to be diligent here. Users expect the new system to be better, but AEM most certainly is different. So that difference needs to be defined to manage the user’s expectations. Like what does AEM UI look and feel, and what formats and transformations exactly are supported by the new system.
If we know what the new system must deliver, the hard part begins. We need to figure out how the data is transformed from the legacy system to AEM. Copying the binary data is trivial. The hard part is mapping the metadata from legacy to AEM. Individual asset metadata, but also structural data like hierarchies and taxonomies. In one of my last projects I learned that some dam systems store assets in a flat table and make them accessible by browsing through one or more filter taxonomies. AEM in contrast supports browsing only in a folder structure. You can use any number of tags in search, but not for browsing. This makes mapping the user experience from the legacy system to AEM virtually impossible. The experience needs to be redefined. In hindsight, I would have done a feasibility study to set expectations properly. If you’re interested in more details, that is described as the last part of the article series I have mentioned earlier. A migration is also a good opportunity to get rid of data no longer in use or relevant. It’s a good idea to incorporate a clean-up procedure in the technical migration. There is no point in migrating 10 terabytes of data and having business users delete half of it after the migration. Clean-up can happen either on the legacy system before migration or as part of delete list or delete rules fed into transformation. For example, a rule could be do not migrate when creation date is older than X years, except when it’s in folder Y. Don’t forget to plan and analyze the infrastructure. What bandwidth and protocols are available on the legacy system to extract the data and how long does it take to extract the data? Also, measure ahead of time how long it takes to upload the data to AEM. With all this planning, we can now start the implementation. There are two parts. First, the implementation in AEM like metadata schema, processing profiles, UI customizations, workflows and other content optimization. And second, there is the implementation of the migration procedures, which will do the extraction and transformation. At this point, we need access to the infrastructure to extract data from the legacy system and to store it on an intermediate cloud storage. More on that in a bit. At this point, we should also know which scripting language we want to use and what other tools will help. As this session is about migration, I will focus on this topic. The AEM implementation is a topic for a different time. Keep in mind, though, that AEM and migration implementation go hand in hand. You need to transform the data into the format that AEM expects. You can only expect data that is actually available, clean and transformable. Implementing the actual migration script is a highly iterative process. You will execute it several times, analyze the results, gain feedback from business users and adapt and repeat the process multiple times. Let’s see how I usually set up the technical migration. It will look quite old school first using CSV files and batch jobs. But trust me, there is a good reason. I have read somewhere that most banks process their data in batch jobs and not online for good reasons. A for performance and B to control the quality. In a migration, we have these same nonfunctional requirements. OK, here is my migration blueprint. I am assuming we want to migrate to AEM as a cloud service. I use a mix of individual scripts for extraction and transformation, each one of which communicating via CSV files with the next. I do not implement fancy inter-process communications. Scripts load one file, transform and store another file. That then can be used by the next script. Bear with me as of why this will all make sense in a couple of minutes. To give you a rough orientation, the red bubbles or boxes are tools that you can use out of the box. The gray boxes represent scripts that you need to implement per project. And the blue bubble is the interface of the cloud storage we will use. Let’s start at the top right corner on the legacy dam. We assume that the system has a way to export the data to a cloud storage provider. This can be either AWS S3 or Azure Blobstore. Most legacy systems I have encountered do have such an interface. They could either upload directly to the cloud or use the SFTP protocol, which is also supported by AWS and Azure. In another case, the legacy dam stored the assets in a Windows file system. We could use a desktop tool from Microsoft to upload these files to Azure. Bottom line is, in the past, there always has been a way. If you need to write your own transfer script, keep in mind it must run somewhere in a data center with proper bandwidth. You can use the option to download 10 terabytes of assets to your home office. Now the binary data is in the cloud storage. We then extract the asset metadata, folder structure and taxonomy data into discrete files and transform them into CSV files. Often legacy dam systems provide such exports as XML or JSON. I prefer to clean the data up and transform them into CSV files. This is much easier to read or scan through thousands of lines in a spreadsheet than to interpret the according structured files. At this point, the transformation is only a form of transformation. The information is the same, maybe reduced by what we do not need to consider.
The CSV files are then ingested by another script or a better set of scripts that maps the data according to defined business rules. Here we can also incorporate business input such as metadata mapping files or a list of folders to be deleted. The mapper scripts generate three artifacts. A folder mapping file for building or restructuring the folder hierarchy for AEM. A taxonomy file containing all tags required in AEM and an asset metadata file. The folder mapping file is the input for binary structure stripped. The idea here is that we assume that the folder structure on the legacy system is not the same as we require in AEM. Either because there are business rules in place to restructure or because the old structure is simply flat and we need a folder hierarchy that was derived from the asset metadata. Here is the trick. The restructure script calls the respective cloud storage API to copy the binary data from the legacy structure into the required and mapped new structure. Yes, copy, not move. We might want to repeat that process, thus do not want to destroy the old structure. Cloud storage API allow copy operations without down and re-uploading and the execution usually takes only a couple of minutes. This is determined by the number of files, not their size.
The new structure can then be imported into AEM as a cloud service instance using bulk importer. Bulk importer is a tool that is built into AEM as a cloud service. The process is a cloud to cloud process and accordingly fast.
The second artifact the mapper produces is a taxonomy CSV file. This CSV file can be directly transformed into a tag structure in content CQ tags using ACS Commons TagMaker. A little downside here, the latest version of TagMaker only supports Excel files. I usually install an older version of ACS Commons on a local instance and create a CQ package from there that I can then import via package manager to the cloud instance. There are some libraries on the market that could create Excel files directly, but I never had a chance to test them mostly due to time constraints. And I have never found it worthwhile learning those for a one-off task, but this should of course not keep you from using this path. The third artifact created is a CSV file that represents all mapped asset metadata. This can be directly ingested into AEM using the built metadata importer. You might want to do an export first to figure out what format the CSV file should have. On the last slide I painted the mapper as one script, but in reality this is often a set of individual scripts that again read and write individual CSV files. The point of having individual scripts that create individual files is that it allows us to fine-tune and repeat each individual step without having to repeat all transformations that happened before. Working with local CSV files is much faster than having to download live data on each run. It also allows us to better analyze intermediate results that we could do with a debugger and log files. In my experience, it is quite easy to spot regularities and irregularities in the data when the CSV data is imported into Excel. The Excel files also can be shared with business users for QA and approval. You can expect the business user to click on thousands of assets in AEM after the migration for QA. Also, most scripting languages support CSV files and processing those files is an order of magnitude faster than corresponding JSON or XML files, which allows running and re-running while developing the scripts. In case you rarely work with Excel and because I have seen people not using its full potential, here a quick summary on how to properly import a CSV file into Excel. Select data from the main menu, click Get Data, opt for text slash CSV and select your CSV file. Excel automatically detects the encoding, hopefully you’re using UTF-8, and the column delimiter, tab, comma or colon or whatever was chosen when the files were created. Voila! The result is a nicely looking table that you can share with business stakeholders. So far I’ve just mentioned scripts, but what is the ideal language for that purpose? Well, it’s your choice, but keep in mind that migration is a highly iterative process, so you might want to use a scripting language that does not require compilation. Apart from that, use whatever you’re comfortable with. But there are two crucial parts to consider. The blueprint requires that you remotely copy data on the cloud storage from one set of folders to another. So you might want to check that your scripting language has an SDK for that storage. You probably do not want to use bare HTTP REST calls for that purpose. I guess that rules out Perl, but Python or JavaScript work.
My personal tip, use Node.js. If you’re watching this session, you are probably an AM backend developer. You might want to use it and feel more at home with Java or Python. But this is a good opportunity to refresh your JavaScript or upskill yourself towards full stack developer. Migration scripts are one of code that you throw away at the end of the project. So it does not have to have clean code production quality. It’s really a risk-free opportunity to learn a new language. And if you plan to stay in the 51ºÚÁϲ»´òìÈ Cosmos, 51ºÚÁϲ»´òìÈ is heavily leaning on JavaScript these days. More and more parts for or in AM are built with App Builder JavaScript. Also, 51ºÚÁϲ»´òìÈ provides React Spectrum to provide developing UIs in JavaScript. And from what I have heard, easier to learn than using the AM granite library. Edge delivery services heavily relies on browser-side JavaScript as well. An EDS project usually does not require any or very little Java backend code. JavaScript has matured quite a bit in the last decade. And actually, I found it quite fun to work with. I find the only difficult part in JavaScript is when you start asynchronously handling IO. But if you’re a starter, for most of the batch processing required here, you can use the synchronous versions of the APIs which removes this hurdle. My migration approach follows a couple of principles I have already touched upon. Let’s repeat them anyway, just so the whole approach makes more sense and does not look like a random set of choices. The first and foremost principle is quality. A couple of minutes ago, I said we don’t want to create production quality code for the migration. And that is still true. Only the migrated data needs to be of best quality. I can’t stress this enough. When your rendering code in a sites project is buggy, well, you fix the bug, redeploy, and then you’re good. But if there are flaws in the migrated data, you probably can’t repair those as easily. If you detect them after a couple of weeks, well, you can’t just rerun the migration. To mitigate that, it is a good idea to keep the original data and all CSV files in archive for some times, so that you could repair them from that base. I also keep legacy asset IDs in the metadata so that assets can always be cross referenced later, even after they have been moved, renamed, or altered. Another means to mitigate the risk of data rot in the migration is to queue A intermediate results thoroughly and frequently. Thus the need for CSV files, which make the process quite transparent. The whole process and each stage on the process must be repeatable. That is why each script reads data from one file and writes to another file. Especially the scripts do not alter data in place. That is also why we copy the data on the cloud storage from one folder to another and do not move. All operations are non-destructive. The scripts run offline on your local machine to speed up the process and allow for interactive development. Binaries are never downloaded to your local system for obvious reasons. Use the scripting language, write quick and dirty code. Each migration is the one of hexafice. Especially do not gold plate the migration scripts. Throw them away anyway. Do not create frameworks with complex rule engines. Expect the rules you require directly in the scripting language. Whatever rule engine you might come up with, it will never be complete. Plus, think if it is worth the invest. How many migration projects do you have per year? And even if your company has two or three, is it worth the effort to document your framework so that others can make use of it? These are the tools I have used in my past projects. Node.js has a learning opportunity and because it supports CSV, AWS and HTTP. VS code for scripting and debugging and to eyeball the CSV files. By the way, VS code is super fast when loading huge CSV files. And there is an excellent extension called Rainbow CSV that can syntax highlight CSV files by column. For quantitative analysis and to quickly spot irregularities, I import CSV into Excel. These files can then be shared with business users for QA. I import the CSV into Excel manually. Again, still haven’t had the time to learn how to automate Excel file generation. I use an older version of ACS Commons on my local machine to transfer CSV files into a text structure. And for importing binary and metadata, I use the built-in tools from AM as a cloud service. I’m pretty sure you know about the aforementioned tools already, but maybe not about Rainbow CSV for VS code. I strongly encourage you to give it a shot when working with CSV files. It makes reading so much easier. See for yourself in the comparison with without syntax highlighting.
Earlier I mentioned that table-based visualization of data helps scan against spotting irregularities. To prove my point, I ask you to scan the following screenshot. Don’t bother to read what’s in there. Let’s take three seconds. You see the issues? Yes, right. There are some issues probably with the encoding.
Talking about encoding, one last tip on working with CSV files. I learned the hard way. When writing UTF-8 text files and CSVs are text files, it is a good habit to write a few marker bytes at the top of the file. This is called the byte order marker or BOM. Don’t worry, you probably don’t have to ever write or explicitly read these yourself. But tools and libraries you use might do so or not. In my first project, it took me the better half of the day to figure out that some do and some don’t. That was causing quite some headache back then. When you know what to look for, it’s quite easy to spot. Editors and the command line tools handle those BOM correctly so you won’t spot them. But a hex dump reveals if they’re there or not. Check your CSV library for BOM support and how it can be turned on or off. That’s all I have on the migration implementation. Maybe a few tips and lessons learned on the migration execution. First, declare a content freeze during the migration to prevent users from uploading new assets after you have dumped the snapshot from legacy. How long the freeze must be depends on how much data you have and how smooth the process works. Let me test migration can help estimate the timing. A big bang migration with a content freeze is the simplest option. If you need a parallel operation, well, consider more time in the preparation and development. Also reserve plenty of time and test resource after migration execution on production. You want to get an official business approval as soon as possible. You can’t lift the content freeze until you have that approval. So you want to keep this as compact as possible. If you have a large volume of data and a small content freeze window, consider doing an initial import of the binary data early in the project. In the actual live migration, you would only have to transfer the new assets as a top up. Make sure though that all project members understand the difference between a top up and a sync. A top up usually only adds new assets and maybe changes to previously added assets. It does not track assets that have been removed from the original and it is difficult to keep track of assets if they have been moved on either system. If that is a requirement, extra coding might be required to do some cleanup after the migration. That’s all I have for now. Thank you for staying with me until the end. If you want to get more details, please find me on medium.com. You can follow me there for more articles on AEM. Bye bye.