Parsing e-mail for info

[UPDATE] Added a version that works on OS X 10.4 to the bottom of this post. Does some different logging and is tighter, so I’ll have to go back to the original and update again.

This was done earlier, but I have since had to change it to match changes in the incoming data. Some of which is a big mystery. In particular a new line character that is input as the “line separator” Unicode character. It would end up in the clipboard as line feed, so I just started plugging in white space character id’s until one hit the mark. Lucky me.

With that in mind, this is tuned for my specific data, but should be readily taken apart and tweaked to work for others. It’s mostly simple text item delimiter filtering, which is only done a few times.

Things to note:

There is a first and/or last “character” removal subroutine that really does text items, and has no mechanism to check what is being left out of the return string.

There is a LOT of logging peppered in the main part of the script. Hopefully it (and how to turn it off) is easy to work with. It’s really the only way to nail down what you’re getting and where you’re getting it.

Last but certainly not least, and the star of our show, there is a (US) state names abbreviation subroutine. Could easily be run in reverse or even set-up with a switch. BTW, you’re welcome. 😀

PS., You may notice there is a second text abbreviation function. The first one, for state names, is pure AppleScript and use the input as a single object to identify the position from one list that we will pull from the other. The second one, for street address abbreviations, uses sed and examines whole lines against the entire list of patterns (with another being matching replacements). You could do the state names with the sed routine, but it seems to me it would be slower to do so since at some point you are looking for things guaranteed to not be part of the line you’re feeding it.

Although I could do two sed based routines, one for each line of the address and with their own focused lists. I wouldn’t have to extract the state name doing it that way. Just throw the whole line at it… Hmmm, to be continued?

-- http://strawhousepig.net/


-- Be aware that upper_cased() also removes every "." from text that is sent to it.
set _search to "Here is the knowledge you seek"

-- String that selected e-mails must contain. Can also be set to "is" instead of "contains" below.

set debug to true

-- Once you get things working set to 'false'.
set theAddresses to ""

set theCount to 0

set errors to ""
tell application "Mail"

	set _messages to selection as list

	repeat with m in _messages

		try

			if subject of m contains _search then

				-- Can also look at other message properties for the _search string.

				set _content to content of m as rich text

				set my text item delimiters to {"Name", "Email", "Address", "United States"}

				if debug is true then

					-- Find which text items contain the data we are after.

					-- This time we want to label everything because it may be a lot of items, including whitespace items.

					-- Counting text items after we refine the data here should be a lot less tricky. You can of course do it like this again, too.

					repeat with t from 1 to count of text items of _content

						try

							log "text item " & t & " of _content = " & item t of every text item of _content

						end try

					end repeat

				end if

				set _name to text item 2 of _content

				set _address to text item 4 of _content

				set my text item delimiters to {character id 10, character id 13, character id 8232}

				-- 10 = line feed, 13 = carriage return, 8232 = line separator

				if debug is true then

					log "_name :" & text items of _name

					log "_address :" & text items of _address

				end if

				set _name to my upper_cased(text item 3 of _name)

				set _address1 to my upper_cased(text item 3 of _address)

				set _address2 to my upper_cased(text item 4 of _address)

				if debug is true then

					log "_name: " & text items of _name

					log "_address1: " & text items of _address1

					log "_address2: " & text items of _address2

				end if

				set my text item delimiters to {",", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9"}

				set _city to text item 1 of _address2

				set _state to text item 2 of _address2 -- This leaves us with a space before and after the text, which we remove later.

				set my text item delimiters to {space}

				set _zip to last text item of _address2

				if debug is true then

					log "_city: " & _city

					log "_state: " & _state

					log "_zip: " & _zip

				end if

				set _state to my state_abr(my trim_char(_state, "both"))

				if debug is true then

					log "_state: " & _state

				end if

				set theAddresses to theAddresses & _name & return & my address_abr(_address1) & return & _city & ", " & _state & " " & _zip & return & return

				set theCount to theCount + 1

			else

				set errors to "Selected e-mail subject does not contain what we are looking for."

			end if

		on error theErr

			set errors to errors & theErr & return & return

		end try

	end repeat

	if theCount is not 0 then

		set the clipboard to theAddresses

		display dialog (theCount as string) & " addresses ready to paste." with icon 1

	else

		display dialog "No addresses were extracted from selected e-mails. :(" with icon 2

	end if

	if errors is not "" then display dialog "Some funny business was reported:" & return & return & errors

end tell
on trim_char(the_input, first_last_both)

	set my text item delimiters to {}

	set trim_count to count of text items of the_input

	if first_last_both is "first" then return text items 2 thru trim_count of the_input as string

	if first_last_both is "last" then return text items 1 thru (trim_count - 1) of the_input as string

	if first_last_both is "both" then return text items 2 thru (trim_count - 1) of the_input as string

end trim_char
on upper_cased(_input)

	return (do shell script "echo " & quoted form of _input & " | sed 's/\\.//g' | tr a-z A-Z")

end upper_cased
on state_abr(_state)

	-- States don't have to be listed in all caps as this is case insensitive, but that's how it's sent to this function.

	set long_state to {"ALABAMA", "ALASKA", "ARIZONA", "ARKANSAS", "CALIFORNIA", "COLORADO", "CONNECTICUT", "DELAWARE", "FLORIDA", "GEORGIA", "HAWAII", "IDAHO", "ILLINOIS", "INDIANA", "IOWA", "KANSAS", "KENTUCKY", "LOUISIANA", "MAINE", "MARYLAND", "MASSACHUSETTS", "MICHIGAN", "MINNESOTA", "MISSISSIPPI", "MISSOURI", "MONTANA", "NEBRASKA", "NEVADA", "NEW HAMPSHIRE", "NEW JERSEY", "NEW MEXICO", "NEW YORK", "NORTH CAROLINA", "NORTH DAKOTA", "OHIO", "OKLAHOMA", "OREGON", "PENNSYLVANIA", "RHODE ISLAND", "SOUTH CAROLINA", "SOUTH DAKOTA", "TENNESSEE", "TEXAS", "UTAH", "VERMONT", "VIRGINIA", "WASHINGTON", "WEST VIRGINIA", "WISCONSIN", "WYOMING"}

	set short_state to {"AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"}

	repeat with n from 1 to count of long_state

		if item n of long_state is _state then return item n of short_state

	end repeat

	-- If someone doesn't know how to spell we need to send the original back.

	return _state

end state_abr

on address_abr(raw_address) -- WE ARE... We are changing case to upper before sending to this function to avoid style differences. Why? -- Because `sed` on OS X cannot be used case insensitive (same for NetBSD apparently, and is POSIX compliant?). set long_address to {"NORTH", "SOUTH", "EAST", "WEST", "NORTHEAST", "NORTHWEST", "SOUTHEAST", "SOUTHWEST", "ROAD", "STREET", "AVENUE", "BOULEVARD", "PLACE", "CIRCLE", "DRIVE", "LANE", "ROUTE", "SUITE", "APARTMENT", "PARKWAY", "HIGHWAY", "EXPRESSWAY", "BYPASS", "CAUSEWAY", "STRAVENUE", "THROUGHWAY", "TURNPIKE", "VIADUCT"} set short_address to {"N", "S", "E", "W", "NE", "NW", "SE", "SW", "RD", "ST", "AVE", "BLVD", "PL", "CIR", "DR", "LN", "RTE", "STE", "APT", "PKWY", "HWY", "EXPY", "BYP", "CSWY", "STRA", "TRWY", "TPKE", "VIA"} repeat with s from 1 to count of long_address set raw_address to (do shell script "echo \"" & raw_address & "\" | sed 's/[[:<:]]" & item s of long_address & "[[:>:]]/" & item s of short_address & "/g'") -- Taking apart the `sed` command: s=replace; /=Regex delimiter; [[:<:]] & [[:>:]]=PATTERN delineation so we don't have to worry -- about "NORTHERN" or "BROADWAY", aka word boundary; g=all occurances of PATTERN. Try it out! -- echo BROADWAY ROAD | sed 's/[[:<:]]ROAD[[:>:]]/RD/g' end repeat return raw_address end address_abr

The below version works on 10.4 (AppleScript version 1.10), and has not been tested beyond that.
-- http://strawhousepig.net/


set _search to "Here is the knowledge you seek"

-- String that selected e-mails must contain. Can also be set to "is" instead of "contains" below.

set debug to true

-- If things don't work out as planned, set to true and look at the Replies pane below.
set theAddresses to ""

set theCount to 0

set errors to ""
tell application "Mail"

	set _messages to selection as list

	repeat with m in _messages

		try

			if subject of m contains _search then

				-- Can also look at other message properties for the _search string.

				set _content to content of m

				if debug is true then

					-- Find which lines (paragraphs) contain the data we are after.

					repeat with p from 1 to count of paragraphs of _content

						try

							log "paragraph " & p & " = " & item p of every paragraph of _content

						end try

					end repeat

				end if

				set _name to my upper_cased(paragraph 7 of _content)

				-- paragraph 13 contains the next set of data I need, however it also contains a Unicode "line separator" character.

				set _content2 to (do shell script "echo " & (paragraph 13 of _content) & " | grep \\n")

				-- This susses out my oddball "line separator" character without using AppleScript text item delimiters because,

				-- AppleScript versions before 2.0 (OS X 10.5) were not, or not readily, able to work with Unicode characters.

				-- We (as in I) are forced to do this since we (as in I) still run OS X 10.4 at times.

				if debug is true then

					repeat with p2 from 1 to count of paragraphs of _content2

						try

							log "paragraph2 " & p2 & " = " & paragraph p2 of _content2

						end try

					end repeat

				end if

				set _address1 to my upper_cased(paragraph 1 of _content2)

				set _address2 to my upper_cased(paragraph 2 of _content2)

				if debug is true then

					log "_name: " & _name

					log "_address1: " & _address1

					log "_address2: " & _address2

				end if

				set my text item delimiters to {","}

				-- Not sure if it's just me, but on 10.4 I can't use more than the first delimiter in a list. FML...

				-- Isolate city from state and numbers at the end. Hopefully there is always a comma after the city.

				set _city to text item 1 of _address2

				set _state to my state_abr(words 1 thru ((count of words of text item 2 of _address2) - 1) of text item 2 of _address2)

				set _zip to last word of text item 2 of _address2

				if debug is true then

					log "_city: " & _city

					log "_state: " & _state

					log "_zip: " & _zip

				end if

				set theAddresses to theAddresses & _name & return & my address_abr(_address1) & return & _city & ", " & _state & " " & _zip & return & return

				set theCount to theCount + 1

			else

				set errors to "Selected e-mail subject does not contain \"" & _search & "\""

			end if

		on error theErr

			set errors to errors & theErr & return & return

		end try

	end repeat

	if theCount is not 0 then

		set the clipboard to theAddresses

		display dialog (theCount as string) & " addresses ready to paste." with icon 1

	else

		display dialog "No addresses were extracted from selected e-mails. :(" with icon 2

	end if

	if errors is not "" then display dialog "Some funny business was reported:" & return & return & errors

end tell
on upper_cased(_input)

	return (do shell script "echo " & _input & " | tr a-z A-Z")

end upper_cased
on state_abr(_state)

	set long_state to {"ALABAMA", "ALASKA", "ARIZONA", "ARKANSAS", "CALIFORNIA", "COLORADO", "CONNECTICUT", "DELAWARE", "FLORIDA", "GEORGIA", "HAWAII", "IDAHO", "ILLINOIS", "INDIANA", "IOWA", "KANSAS", "KENTUCKY", "LOUISIANA", "MAINE", "MARYLAND", "MASSACHUSETTS", "MICHIGAN", "MINNESOTA", "MISSISSIPPI", "MISSOURI", "MONTANA", "NEBRASKA", "NEVADA", "NEW HAMPSHIRE", "NEW JERSEY", "NEW MEXICO", "NEW YORK", "NORTH CAROLINA", "NORTH DAKOTA", "OHIO", "OKLAHOMA", "OREGON", "PENNSYLVANIA", "RHODE ISLAND", "SOUTH CAROLINA", "SOUTH DAKOTA", "TENNESSEE", "TEXAS", "UTAH", "VERMONT", "VIRGINIA", "WASHINGTON", "WEST VIRGINIA", "WISCONSIN", "WYOMING"}

	set short_state to {"AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"}

	repeat with n from 1 to count of long_state

		if item n of long_state is _state then return item n of short_state

	end repeat

	-- If someone doesn't know how to spell we need to send the original back.

	return _state

end state_abr

on address_abr(raw_address) -- This is some hot garbage right here. -- WE ARE... We are changing case to upper before sending to this function to avoid style differences. Why? -- Because `sed` on OS X cannot be used case insensitive (same for NetBSD apparently, and is POSIX compliant?). set long_address to {"NORTH", "SOUTH", "EAST", "WEST", "NORTHEAST", "NORTHWEST", "SOUTHEAST", "SOUTHWEST", "ROAD", "STREET", "AVENUE", "BOULEVARD", "PLACE", "CIRCLE", "DRIVE", "LANE", "ROUTE", "SUITE", "APARTMENT", "PARKWAY", "HIGHWAY", "EXPRESSWAY", "BYPASS", "CAUSEWAY", "STRAVENUE", "THROUGHWAY", "TURNPIKE", "VIADUCT"} set short_address to {"N", "S", "E", "W", "NE", "NW", "SE", "SW", "RD", "ST", "AVE", "BLVD", "PL", "CIR", "DR", "LN", "RTE", "STE", "APT", "PKWY", "HWY", "EXPY", "BYP", "CSWY", "STRA", "TRWY", "TPKE", "VIA"} repeat with s from 1 to count of long_address set raw_address to (do shell script "echo \"" & raw_address & "\" | sed 's/\\b" & item s of long_address & "\\b/" & item s of short_address & "/g'") -- Taking apart the `sed` command: s=replace; /=Regex delimiter; [[:<:]] & [[:>:]]=PATTERN deliniation (so we don't have to worry -- about "NORTHERN" or "BROADWAY", it only matches whole words); g=all occurances of PATTERN -- sed 's/[[:<:]]PATTERN[[:>:]]/REPLACEMENT/g' end repeat return raw_address end address_abr

StrawHousePig.net

Web things about junk and stuff

Leave a Reply Cancel reply