Parsing e-mail for info

[UPDATE] Added a version that works on OS X 10.4 to the bottom of this post. Does some different logging and is tighter, so I’ll have to go back to the original and update again.

This was done earlier, but I have since had to change it to match changes in the incoming data. Some of which is a big mystery. In particular a new line character that is input as the “line separator” Unicode character. It would end up in the clipboard as line feed, so I just started plugging in white space character id’s until one hit the mark. Lucky me.

With that in mind, this is tuned for my specific data, but should be readily taken apart and tweaked to work for others. It’s mostly simple text item delimiter filtering, which is only done a few times.

Things to note:

There is a first and/or last “character” removal subroutine that really does text items, and has no mechanism to check what is being left out of the return string.

There is a LOT of logging peppered in the main part of the script. Hopefully it (and how to turn it off) is easy to work with. It’s really the only way to nail down what you’re getting and where you’re getting it.

Last but certainly not least, and the star of our show, there is a (US) state names abbreviation subroutine. Could easily be run in reverse or even set-up with a switch. BTW, you’re welcome. 😀

PS., You may notice there is a second text abbreviation function. The first one, for state names, is pure AppleScript and use the input as a single object to identify the position from one list that we will pull from the other. The second one, for street address abbreviations, uses sed and examines whole lines against the entire list of patterns (with another being matching replacements). You could do the state names with the sed routine, but it seems to me it would be slower to do so since at some point you are looking for things guaranteed to not be part of the line you’re feeding it.

Although I could do two sed based routines, one for each line of the address and with their own focused lists. I wouldn’t have to extract the state name doing it that way. Just throw the whole line at it… Hmmm, to be continued?

-- http://strawhousepig.net/

-- Be aware that upper_cased() also removes every "." from text that is sent to it.

set _search to "Here is the knowledge you seek"
-- String that selected e-mails must contain. Can also be set to "is" instead of "contains" below.
set debug to true
-- Once you get things working set to 'false'.

set theAddresses to ""
set theCount to 0
set errors to ""

tell application "Mail"
  set _messages to selection as list
  repeat with m in _messages
    try
      if subject of m contains _search then
        -- Can also look at other message properties for the _search string.
        set _content to content of m as rich text
        set my text item delimiters to {"Name", "Email", "Address", "United States"}
        if debug is true then
          -- Find which text items contain the data we are after. 
          -- This time we want to label everything because it may be a lot of items, including whitespace items.
          -- Counting text items after we refine the data here should be a lot less tricky. You can of course do it like this again, too.
          repeat with t from 1 to count of text items of _content
            try
              log "text item " & t & " of _content = " & item t of every text item of _content
            end try
          end repeat
        end if
        set _name to text item 2 of _content
        set _address to text item 4 of _content
        set my text item delimiters to {character id 10, character id 13, character id 8232}
        -- 10 = line feed, 13 = carriage return, 8232 = line separator
        if debug is true then
          log "_name :" & text items of _name
          log "_address :" & text items of _address
        end if
        set _name to my upper_cased(text item 3 of _name)
        set _address1 to my upper_cased(text item 3 of _address)
        set _address2 to my upper_cased(text item 4 of _address)
        if debug is true then
          log "_name: " & text items of _name
          log "_address1: " & text items of _address1
          log "_address2: " & text items of _address2
        end if
        set my text item delimiters to {",", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9"}
        set _city to text item 1 of _address2
        set _state to text item 2 of _address2 -- This leaves us with a space before and after the text, which we remove later.
        set my text item delimiters to {space}
        set _zip to last text item of _address2
        if debug is true then
          log "_city: " & _city
          log "_state: " & _state
          log "_zip: " & _zip
        end if
        set _state to my state_abr(my trim_char(_state, "both"))
        if debug is true then
          log "_state: " & _state
        end if
        set theAddresses to theAddresses & _name & return & my address_abr(_address1) & return & _city & ", " & _state & " " & _zip & return & return
        set theCount to theCount + 1
      else
        set errors to "Selected e-mail subject does not contain what we are looking for."
      end if
    on error theErr
      set errors to errors & theErr & return & return
    end try
  end repeat
  if theCount is not 0 then
    set the clipboard to theAddresses
    display dialog (theCount as string) & " addresses ready to paste." with icon 1
  else
    display dialog "No addresses were extracted from selected e-mails. :(" with icon 2
  end if
  if errors is not "" then display dialog "Some funny business was reported:" & return & return & errors
end tell

on trim_char(the_input, first_last_both)
  set my text item delimiters to {}
  set trim_count to count of text items of the_input
  if first_last_both is "first" then return text items 2 thru trim_count of the_input as string
  if first_last_both is "last" then return text items 1 thru (trim_count - 1) of the_input as string
  if first_last_both is "both" then return text items 2 thru (trim_count - 1) of the_input as string
end trim_char

on upper_cased(_input)
  return (do shell script "echo " & quoted form of _input & " | sed 's/\.//g' | tr a-z A-Z")
end upper_cased

on state_abr(_state)
  -- States don't have to be listed in all caps as this is case insensitive, but that's how it's sent to this function.
  set long_state to {"ALABAMA", "ALASKA", "ARIZONA", "ARKANSAS", "CALIFORNIA", "COLORADO", "CONNECTICUT", "DELAWARE", "FLORIDA", "GEORGIA", "HAWAII", "IDAHO", "ILLINOIS", "INDIANA", "IOWA", "KANSAS", "KENTUCKY", "LOUISIANA", "MAINE", "MARYLAND", "MASSACHUSETTS", "MICHIGAN", "MINNESOTA", "MISSISSIPPI", "MISSOURI", "MONTANA", "NEBRASKA", "NEVADA", "NEW HAMPSHIRE", "NEW JERSEY", "NEW MEXICO", "NEW YORK", "NORTH CAROLINA", "NORTH DAKOTA", "OHIO", "OKLAHOMA", "OREGON", "PENNSYLVANIA", "RHODE ISLAND", "SOUTH CAROLINA", "SOUTH DAKOTA", "TENNESSEE", "TEXAS", "UTAH", "VERMONT", "VIRGINIA", "WASHINGTON", "WEST VIRGINIA", "WISCONSIN", "WYOMING"}
  set short_state to {"AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"}
  repeat with n from 1 to count of long_state
    if item n of long_state is _state then return item n of short_state
  end repeat
  -- If someone doesn't know how to spell we need to send the original back.
  return _state
end state_abr

on address_abr(raw_address)
  -- WE ARE... We are changing case to upper before sending to this function to avoid style differences. Why?
  -- Because `sed` on OS X cannot be used case insensitive (same for NetBSD apparently, and is POSIX compliant?).
  set long_address to {"NORTH", "SOUTH", "EAST", "WEST", "NORTHEAST", "NORTHWEST", "SOUTHEAST", "SOUTHWEST", "ROAD", "STREET", "AVENUE", "BOULEVARD", "PLACE", "CIRCLE", "DRIVE", "LANE", "ROUTE", "SUITE", "APARTMENT", "PARKWAY", "HIGHWAY", "EXPRESSWAY", "BYPASS", "CAUSEWAY", "STRAVENUE", "THROUGHWAY", "TURNPIKE", "VIADUCT"}
  set short_address to {"N", "S", "E", "W", "NE", "NW", "SE", "SW", "RD", "ST", "AVE", "BLVD", "PL", "CIR", "DR", "LN", "RTE", "STE", "APT", "PKWY", "HWY", "EXPY", "BYP", "CSWY", "STRA", "TRWY", "TPKE", "VIA"}
  repeat with s from 1 to count of long_address
    set raw_address to (do shell script "echo "" & raw_address & "" | sed 's/[[:<:]]" & item s of long_address & "[[:>:]]/" & item s of short_address & "/g'")
    -- Taking apart the `sed` command: s=replace; /=Regex delimiter; [[:<:]] & [[:>:]]=PATTERN delineation so we don't have to worry 
    -- about "NORTHERN" or "BROADWAY", aka word boundary; g=all occurances of PATTERN. Try it out!
    -- echo BROADWAY ROAD | sed 's/[[:<:]]ROAD[[:>:]]/RD/g'
  end repeat
  return raw_address
end address_abr

The below version works on 10.4 (AppleScript version 1.10), and has not been tested beyond that.

-- http://strawhousepig.net/

set _search to "Here is the knowledge you seek"
-- String that selected e-mails must contain. Can also be set to "is" instead of "contains" below.
set debug to true
-- If things don't work out as planned, set to true and look at the Replies pane below.

set theAddresses to ""
set theCount to 0
set errors to ""

tell application "Mail"
  set _messages to selection as list
  repeat with m in _messages
    try
      if subject of m contains _search then
        -- Can also look at other message properties for the _search string.
        set _content to content of m
        if debug is true then
          -- Find which lines (paragraphs) contain the data we are after. 
          repeat with p from 1 to count of paragraphs of _content
            try
              log "paragraph " & p & " = " & item p of every paragraph of _content
            end try
          end repeat
        end if
        set _name to my upper_cased(paragraph 7 of _content)
        -- paragraph 13 contains the next set of data I need, however it also contains a Unicode "line separator" character.
        set _content2 to (do shell script "echo " & (paragraph 13 of _content) & " | grep \n")
        -- This susses out my oddball "line separator" character without using AppleScript text item delimiters because,
        -- AppleScript versions before 2.0 (OS X 10.5) were not, or not readily, able to work with Unicode characters.
        -- We (as in I) are forced to do this since we (as in I) still run OS X 10.4 at times.
        if debug is true then
          repeat with p2 from 1 to count of paragraphs of _content2
            try
              log "paragraph2 " & p2 & " = " & paragraph p2 of _content2
            end try
          end repeat
        end if
        set _address1 to my upper_cased(paragraph 1 of _content2)
        set _address2 to my upper_cased(paragraph 2 of _content2)
        if debug is true then
          log "_name: " & _name
          log "_address1: " & _address1
          log "_address2: " & _address2
        end if
        set my text item delimiters to {","}
        -- Not sure if it's just me, but on 10.4 I can't use more than the first delimiter in a list. FML...
        -- Isolate city from state and numbers at the end. Hopefully there is always a comma after the city.
        set _city to text item 1 of _address2
        set _state to my state_abr(words 1 thru ((count of words of text item 2 of _address2) - 1) of text item 2 of _address2)
        set _zip to last word of text item 2 of _address2
        if debug is true then
          log "_city: " & _city
          log "_state: " & _state
          log "_zip: " & _zip
        end if
        set theAddresses to theAddresses & _name & return & my address_abr(_address1) & return & _city & ", " & _state & " " & _zip & return & return
        set theCount to theCount + 1
      else
        set errors to "Selected e-mail subject does not contain "" & _search & """
      end if
    on error theErr
      set errors to errors & theErr & return & return
    end try
  end repeat
  if theCount is not 0 then
    set the clipboard to theAddresses
    display dialog (theCount as string) & " addresses ready to paste." with icon 1
  else
    display dialog "No addresses were extracted from selected e-mails. :(" with icon 2
  end if
  if errors is not "" then display dialog "Some funny business was reported:" & return & return & errors
end tell

on upper_cased(_input)
  return (do shell script "echo " & _input & " | tr a-z A-Z")
end upper_cased

on state_abr(_state)
  set long_state to {"ALABAMA", "ALASKA", "ARIZONA", "ARKANSAS", "CALIFORNIA", "COLORADO", "CONNECTICUT", "DELAWARE", "FLORIDA", "GEORGIA", "HAWAII", "IDAHO", "ILLINOIS", "INDIANA", "IOWA", "KANSAS", "KENTUCKY", "LOUISIANA", "MAINE", "MARYLAND", "MASSACHUSETTS", "MICHIGAN", "MINNESOTA", "MISSISSIPPI", "MISSOURI", "MONTANA", "NEBRASKA", "NEVADA", "NEW HAMPSHIRE", "NEW JERSEY", "NEW MEXICO", "NEW YORK", "NORTH CAROLINA", "NORTH DAKOTA", "OHIO", "OKLAHOMA", "OREGON", "PENNSYLVANIA", "RHODE ISLAND", "SOUTH CAROLINA", "SOUTH DAKOTA", "TENNESSEE", "TEXAS", "UTAH", "VERMONT", "VIRGINIA", "WASHINGTON", "WEST VIRGINIA", "WISCONSIN", "WYOMING"}
  set short_state to {"AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"}
  repeat with n from 1 to count of long_state
    if item n of long_state is _state then return item n of short_state
  end repeat
  -- If someone doesn't know how to spell we need to send the original back.
  return _state
end state_abr

on address_abr(raw_address) -- This is some hot garbage right here.
  -- WE ARE... We are changing case to upper before sending to this function to avoid style differences. Why?
  -- Because `sed` on OS X cannot be used case insensitive (same for NetBSD apparently, and is POSIX compliant?).
  set long_address to {"NORTH", "SOUTH", "EAST", "WEST", "NORTHEAST", "NORTHWEST", "SOUTHEAST", "SOUTHWEST", "ROAD", "STREET", "AVENUE", "BOULEVARD", "PLACE", "CIRCLE", "DRIVE", "LANE", "ROUTE", "SUITE", "APARTMENT", "PARKWAY", "HIGHWAY", "EXPRESSWAY", "BYPASS", "CAUSEWAY", "STRAVENUE", "THROUGHWAY", "TURNPIKE", "VIADUCT"}
  set short_address to {"N", "S", "E", "W", "NE", "NW", "SE", "SW", "RD", "ST", "AVE", "BLVD", "PL", "CIR", "DR", "LN", "RTE", "STE", "APT", "PKWY", "HWY", "EXPY", "BYP", "CSWY", "STRA", "TRWY", "TPKE", "VIA"}
  repeat with s from 1 to count of long_address
    set raw_address to (do shell script "echo "" & raw_address & "" | sed 's/\b" & item s of long_address & "\b/" & item s of short_address & "/g'")
    -- Taking apart the `sed` command: s=replace; /=Regex delimiter; [[:<:]] & [[:>:]]=PATTERN deliniation (so we don't have to worry 
    -- about "NORTHERN" or "BROADWAY", it only matches whole words); g=all occurances of PATTERN
    -- sed 's/[[:<:]]PATTERN[[:>:]]/REPLACEMENT/g'
  end repeat
  return raw_address
end address_abr

Leave a Reply

Your email address will not be published. Required fields are marked *