I am reading from collibra using collibra reader, someĀ data has html tags like - <p> , <div> in it. Is there a way where we can filter out the html tags if its present in the data. like strip html tags like that.
Ā
Let me know if more information is needed.
Thanks.
Best answer by Samuel Muvdi
Hi!
For removing html tags, I would recommend using either a transliterate step or a regex matching step.Ā
Using the transliterate step you can do something like this:
Ā
Ā
And then when you test this out you should see that we removed the <p><p> tags in the cio_data column
Ā
Ā
Ā
The other way you can achieve this (I think the faster method) would be to use the regex matching step like so
Ā
When testing this out, we can see that it gives us just the data between the tags and no tags
<([a-zA-Z0-9_]+)>([\S\s]*)<([a-zA-Z0-9_]+)>
Ā
Ā
Ā
We can then see that $2 takes in our data without tags :)
Hi @Karthikeyan, Iām closing this thread for now. If you have any follow-up questions please feel free to share them in the comments or create a new post šāāļø