REGEX for HTML

niveeklive · May 12, 2024, 8:33pm

Hello! This REGEX formula works perfectly:

$tc(reg, “SUMMARY: Desired_text_here DTSTART”, “SUMMARY:\s(.*?)\sDTSTART”, “$1”)$

And I get “Desired text here” as output. But when I use the regex function with the WebGet function, I don’t get the desired text. Maybe is the HTML formatting, but I can’t find out how to fix it. See the whole formula and the URL:

$tc(reg, wg(“https://tasks.office.com/5b7921be-1ca9-4db8-9d2c-4de4071b1eca/Calendar/User/yzhMWHc7pUeAB40cBlTGDGUAKGdA?t=0_fe113c6a-43de-46bb-9d1a-c74fa5eb25c3_2024-05-05T18%3A25%3A56.9825830%2B00%3A00”, txt), “(?s)SUMMARY:\s*(.?)\sDTSTART”, “$1”)$

It outputs the whole HTML converted to string, without the “SUMMARY:” and “DTSTART”, instead of the text in the middle of both.

Please, could someone help me to correct?

Ace · May 13, 2024, 9:41pm

That link seems to download a .ics file. And I’m getting the following format below which is not similar to your working test.

SUMMARY:A validar
com SEGES na próxi
ma reunião
DTSTART;VALUE=DATE

niveeklive · May 13, 2024, 10:06pm

Exactly, I need the “A validar com SEGES na próxima reunião”, that is the title of the task from MS Planner. It’s between SUMMARY and DTSTART. Not sure if it’s those line breaks characters that Kustom REGEX doesn’t read…

With this formula, I get that output:

$tc(reg, (wg(“https://tasks.office.com/5b7921be-1ca9-4db8-9d2c-4de4071b1eca/Calendar/User/yzhMWHc7pUeAB40cBlTGDGUAKGdA?t=0_fe113c6a-43de-46bb-9d1a-c74fa5eb25c3_2024-05-05T18%3A25%3A56.9825830%2B00%3A00”, txt)), “SUMMARY:\s(.*?)\sDTSTART”, “$1”)$

Ace · May 15, 2024, 12:25pm

This formula doesn’t seem to work for me at all. May I know which version of the Kustom app are you currently running?

niveeklive · May 15, 2024, 3:41pm

Try this one. It removes the searching words. My KWGT version is 3.75b410013

$tc(
     reg, 
      
     wg("https://tasks.office.com/5b7921be-1ca9-4db8-9d2c-4de4071b1eca/Calendar/User/yzhMWHc7pUeAB40cBlTGDGUAKGdA?t=0_fe113c6a-43de-46bb-9d1a-c74fa5eb25c3_2024-05-05T18%3a25%3a56.9825830%2b00%3a00", txt), 

      "(?s)SUMMARY:\s*(.*?)\s*DTSTART", 
      "$1")$```

Ace · May 17, 2024, 3:55pm

Can you set the wg type to raw and match that pattern instead?

niveeklive · May 23, 2024, 1:55am

Just changing from txt to raw? It did not solve. Or is anything here that I’m missing?

frank · May 23, 2024, 1:13pm

Regexp on multi line text is hard, i havent tried myself but you could check flows, with flows you can split this into multiple jobs, so maybe you can first use a regexp to find the right starting position and THEN split the text using another function to get the title

Ace · May 24, 2024, 6:42pm

This seems to work for me granting there is a fixed number of items that you need to extract from that file. For simplicity, I placed the webget into a global variable.

$tc(split, tc(split, gv(content), “SUMMARY”, 1), “DTSTART”, 0)$

niveeklive · May 24, 2024, 9:22pm

Awesome! I tried using wg(link,txt) as a global value, didn’t work. But putting it inside the formula is working!

Split is better than REGEX in this case. Brilliant.

Thank you.

niveeklive · May 24, 2024, 9:37pm

If it isn’t asking too much, how did you download the file, from the link that I shared, and see its content? In Chrome, when I open the link, the empty page shows nothing. Saving it with CTRL + S as HTML or TXT also doesn’t give me anything.

EDIT: Forget it. Opening with Opera gives me the ics file.

niveeklive · May 24, 2024, 10:08pm

One last try: this file has some non-breakable spaces. How to get rid of them?

$tc(reg,“text with non-bre akable spa ces”, " “,”")$

Tried this, no success.

How Kustom reads these non-breakable spaces?

Edit: first, sorry for the many editions. Trying something friday night isn’t a good idea. Then, I’m writing it here so maybe one day help someone.

I used tc(URL, “strange text”) to get the non-breakable spaces and replace it with empty text.

system · June 18, 2024, 10:09pm

This topic was automatically closed 25 days after the last reply. New replies are no longer allowed.