DataWeave For Web Scraping

In an earlier post, I described how you could use DataWeave by Mulesoft to replace a reserved keyword in Salesforce.

Today, I’ll illustrate how you can use a DataWeave script to scrape a web page for data that you can upload to your Salesforce org.

Sometimes, a service or site does not provide an adequate API to extract data from it. In certain circumstances, running a simple data scrape is a good solution, especially if you are retrieving a list of records.

First, go to your website and run your query. In this instance, let’s assume we have a list of news articles. Grab the contents of the web page and save this as a string. Something like this:

String bodyText = 
'  <main>'+
'    <div class="search-results main-search">'+
'      <article>'+
'        <header>'+
'          <a href="https://www.someurl.com/"><h2>A Headline</h2></a>'+
'          <p class="date">December 13, 2023</p>'+
'        </header>'+
'        <p>Details of Article</p>'+
'      </article>'+
'      <article>'+
'        <header>'+
'          <a href="https://www.anotherurl.com/"><h2>Another Headline</h2></a>'+
'           <p class="date">January 12, 2024</p>'+
'        </header>'+
'        <p>Details of Another Article</p>'+
'      </article>'+
'    </div>'+
'  </main>';

This means you can now test without performing a callout.

Now isolate the part of the page you want:

Integer start = bodyText.indexOf('<main>') ;
Integer finish = bodyText.indexOf('</main>') + 7;

String partial = bodyText.mid(start, (finish-start));

Now, you have a chunk of text that you can pass into your function.

From here, you can design your DataWeave script to extract the data. In this case, the script looks like this:

%dw 2.0
input incomingHtml application/xml
output application/json duplicateKeyAsArray=true, writeAttributes=true

fun replaceStr(val) = (val replace ('\t') with('')) replace '\n' with ('')
---
incomingHtml.main.div.*article map (record) -> {
   title: replaceStr(record.header.a.h2 as String), 
   href: record.header.a.@href,
   date: replaceStr(record.header.p),
   body: replaceStr(record.p),
}

This pulls each article out and maps the contents into a title, href, date, and body.

The replaceStr function removes unwanted characters.

The href attribute is interesting as it demonstrates that we can pull attributes from the html string – to do this, you need to use the writeAttributes flag, which preserves attributes as they are fed into the script.

From here, you can invoke the script like this:

DataWeave.Script script = new DataWeaveScriptResource.parseSearchHtml();
DataWeave.Result result = script.execute(new Map<String, Object>{ 'incomingHtml' => partial });
String output = result.getValueAsString();

This returns a JSON object of the form:

[{
  title:"title",
  href:"href",
  date:"date",
  body:"body"
}]

Now, you can render this any way you like!

The only thing you need to do now is change your static text to the text of the request body:

HttpResponse res = http.send(req);
String bodyText = res.getBody();

You are done! Don’t forget to add some error checking here, of course. From here, you can upload this data into your Salesforce org.

Further Customizing Your Salesforce Capabilities

This is just one example of how our team of Salesforce experts helps organizations achieve more with their Salesforce org. If you need insights on how to do more on the platform, our team of experienced architects can help. Contact us to talk with a consultant today.

Leave a Comment

Your email address will not be published. Required fields are marked *

We're celebrating 20 years! Read about our journey here.

Party horn and confetti
Scroll to Top