Parsing e-mail for info

[UPDATE] Added a version that works on OS X 10.4 to the bottom of this post. Does some different logging and is tighter, so I’ll have to go back to the original and update again.

This was done earlier, but I have since had to change it to match changes in the incoming data. Some of which is a big mystery. In particular a new line character that is input as the “line separator” Unicode character. It would end up in the clipboard as line feed, so I just started plugging in white space character id’s until one hit the mark. Lucky me.

With that in mind, this is tuned for my specific data, but should be readily taken apart and tweaked to work for others. It’s mostly simple text item delimiter filtering, which is only done a few times.

Things to note:

There is a first and/or last “character” removal subroutine that really does text items, and has no mechanism to check what is being left out of the return string.

There is a LOT of logging peppered in the main part of the script. Hopefully it (and how to turn it off) is easy to work with. It’s really the only way to nail down what you’re getting and where you’re getting it.

Last but certainly not least, and the star of our show, there is a (US) state names abbreviation subroutine. Could easily be run in reverse or even set-up with a switch. BTW, you’re welcome. 😀

PS., You may notice there is a second text abbreviation function. The first one, for state names, is pure AppleScript and use the input as a single object to identify the position from one list that we will pull from the other. The second one, for street address abbreviations, uses sed and examines whole lines against the entire list of patterns (with another being matching replacements). You could do the state names with the sed routine, but it seems to me it would be slower to do so since at some point you are looking for things guaranteed to not be part of the line you’re feeding it.

Although I could do two sed based routines, one for each line of the address and with their own focused lists. I wouldn’t have to extract the state name doing it that way. Just throw the whole line at it… Hmmm, to be continued?

-- http://strawhousepig.net/

-- Be aware that upper_cased() also removes every "." from text that is sent to it.

set _search to "New submission from "
-- String that selected e-mails must contain. Can also be set to "is" instead of "contains" below.
set debug to true
-- Once you get things working set to 'false'.

set theAddresses to ""
set theCount to 0
set errors to ""

tell application "Mail"
	set _messages to selection as list
	repeat with m in _messages
		try
			if subject of m contains _search then
				-- Can also look at other message properties for the _search string.
				set _content to (content of m) as rich text
				set my text item delimiters to {"Name", "Email", "Address", "United States"}
				if debug is true then
					-- Find which text items contain the data we are after. 
					-- This time we want to label everything because it may be a lot of items, including whitespace items.
					-- Counting text items after we refine the data here should be a lot less tricky. You can of course do it like this again, too.
					repeat with t from 1 to count of text items of _content
						try
							log "text item " & t & " of _content = " & item t of every text item of _content
						end try
					end repeat
				end if
				set _name to text item 2 of _content
				set _address to text item 4 of _content
				set my text item delimiters to {character id 10, character id 13, character id 8232}
				-- 10 = line feed, 13 = carriage return, 8232 = line separator
				if debug is true then
					log "_name :" & text items of _name
					log "_address :" & text items of _address
				end if
				set _name to my upper_cased(text item 3 of _name)
				set _address1 to my upper_cased(text item 3 of _address)
				set _address2 to my upper_cased(text item 4 of _address)
				if debug is true then
					log "_name: " & text items of _name
					log "_address1: " & text items of _address1
					log "_address2: " & text items of _address2
				end if
				set my text item delimiters to {",", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9"}
				set _city to text item 1 of _address2
				set _state to text item 2 of _address2 -- This leaves us with a space before and after the text, which we remove later.
				set my text item delimiters to {space}
				set _zip to last text item of _address2
				if debug is true then
					log "_city: " & _city
					log "_state: " & _state
					log "_zip: " & _zip
				end if
				set _state to my state_abr(my trim_char(_state, "both"))
				if debug is true then
					log "_state: " & _state
				end if
				set theAddresses to theAddresses & _name & return & my address_abr(_address1) & return & _city & ", " & _state & " " & _zip & return & return
				set theCount to theCount + 1
			else
				set errors to "Selected e-mail subject does not contain what we are looking for."
			end if
		on error theErr
			set errors to errors & _name & ": " & theErr & return & return
		end try
	end repeat
	if theCount is not 0 then
		--set the clipboard to theAddresses --swap the comments from this and "my write2file() below to send to clipboard.
		--display dialog (theCount as string) & " addresses ready to paste." with icon 1
		my write2file(theAddresses)
		-- Could be turned into a CSV file fairly easy.
	else
		display dialog "No addresses were extracted from selected e-mails. :(" with icon 2
	end if
	if errors is not "" then display dialog "Some funny business was reported:" & return & return & errors & return & return & "Run script from ScriptEditor to view more info."
end tell

on write2file(_input)
	set listFile to ((path to public folder) as text) & "Parsed addresses " & (do shell script "date \"+ %Y-%m-%d %H%M\"") & ".txt"
	tell application "Finder"
		set newFile to (open for access file listFile with write permission)
		set eof newFile to 0
		write _input to newFile
		close access newFile
		tell application "TextEdit" to open listFile as alias
	end tell
end write2file

on trim_char(the_input, first_last_both)
	set my text item delimiters to {}
	set trim_count to count of text items of the_input
	if first_last_both is "first" then return text items 2 thru trim_count of the_input as string
	if first_last_both is "last" then return text items 1 thru (trim_count - 1) of the_input as string
	if first_last_both is "both" then return text items 2 thru (trim_count - 1) of the_input as string
end trim_char

on upper_cased(_input)
	return (do shell script "echo " & quoted form of _input & " | sed 's/\\.//g' | tr a-z A-Z")
end upper_cased

on state_abr(_state)
	-- States don't have to be listed in all caps as this is case insensitive, but that's how it's sent to this function.
	set long_state to {"ALABAMA", "ALASKA", "ARIZONA", "ARKANSAS", "CALIFORNIA", "COLORADO", "CONNECTICUT", "DELAWARE", "FLORIDA", "GEORGIA", "HAWAII", "IDAHO", "ILLINOIS", "INDIANA", "IOWA", "KANSAS", "KENTUCKY", "LOUISIANA", "MAINE", "MARYLAND", "MASSACHUSETTS", "MICHIGAN", "MINNESOTA", "MISSISSIPPI", "MISSOURI", "MONTANA", "NEBRASKA", "NEVADA", "NEW HAMPSHIRE", "NEW JERSEY", "NEW MEXICO", "NEW YORK", "NORTH CAROLINA", "NORTH DAKOTA", "OHIO", "OKLAHOMA", "OREGON", "PENNSYLVANIA", "RHODE ISLAND", "SOUTH CAROLINA", "SOUTH DAKOTA", "TENNESSEE", "TEXAS", "UTAH", "VERMONT", "VIRGINIA", "WASHINGTON", "WEST VIRGINIA", "WISCONSIN", "WYOMING"}
	set short_state to {"AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"}
	repeat with n from 1 to count of long_state
		if item n of long_state is _state then return item n of short_state
	end repeat
	-- If someone doesn't know how to spell we need to send the original back.
	return _state
end state_abr

on address_abr(raw_address)
	-- WE ARE... We are changing case to upper before sending to this function to avoid style differences. Why?
	-- Because `sed` on OS X cannot be used case insensitive (same for NetBSD apparently, and is POSIX compliant?).
	set long_address to {"NORTH", "SOUTH", "EAST", "WEST", "NORTHEAST", "NORTHWEST", "SOUTHEAST", "SOUTHWEST", "ROAD", "STREET", "AVENUE", "BOULEVARD", "PLACE", "CIRCLE", "DRIVE", "LANE", "ROUTE", "SUITE", "APARTMENT", "PARKWAY", "HIGHWAY", "EXPRESSWAY", "BYPASS", "CAUSEWAY", "STRAVENUE", "THROUGHWAY", "TURNPIKE", "VIADUCT"}
	set short_address to {"N", "S", "E", "W", "NE", "NW", "SE", "SW", "RD", "ST", "AVE", "BLVD", "PL", "CIR", "DR", "LN", "RTE", "STE", "APT", "PKWY", "HWY", "EXPY", "BYP", "CSWY", "STRA", "TRWY", "TPKE", "VIA"}
	repeat with s from 1 to count of long_address
		set raw_address to (do shell script "echo \"" & raw_address & "\" | sed 's/[[:<:]]" & item s of long_address & "[[:>:]]/" & item s of short_address & "/g'")
		-- Taking apart the `sed` command: s=replace; /=Regex delimiter; [[:<:]] & [[:>:]]=PATTERN delineation so we don't have to worry 
		-- about "NORTHERN" or "BROADWAY", aka word boundary; g=all occurances of PATTERN. Try it out!
		-- echo BROADWAY ROAD | sed 's/[[:<:]]ROAD[[:>:]]/RD/g'
	end repeat
	return raw_address
end address_abr

The below version works on 10.4 (AppleScript version 1.10), and has not been tested beyond that.

-- http://strawhousepig.net/

set _search to "New submission from "
-- String that selected e-mails must contain. Can also be set to "is" instead of "contains" below.
set debug to true
-- If things don't work out as planned, set to true and look at the Replies pane below.

set theAddresses to ""
set theCount to 0
set errors to ""

tell application "Mail"
	set _messages to selection as list
	repeat with m in _messages
		try
			if subject of m contains _search then
				-- Can also look at other message properties for the _search string.
				set _content to content of m
				if debug is true then
					-- Find which lines (paragraphs) contain the data we are after. 
					repeat with p from 1 to count of paragraphs of _content
						try
							log "paragraph " & p & " = " & item p of every paragraph of _content
						end try
					end repeat
				end if
				set _name to my upper_cased(paragraph 7 of _content)
				-- paragraph 13 contains the next set of data I need, however it also contains a Unicode "line separator" character.
				set _content2 to (do shell script "echo " & (paragraph 13 of _content) & " | grep \\n")
				-- This susses out my oddball "line separator" character without using AppleScript text item delimiters because,
				-- AppleScript versions before 2.0 (OS X 10.5) were not, or not readily, able to work with Unicode characters.
				-- We (as in I) are forced to do this since we (as in I) still run OS X 10.4 at times.
				if debug is true then
					repeat with p2 from 1 to count of paragraphs of _content2
						try
							log "paragraph2 " & p2 & " = " & paragraph p2 of _content2
						end try
					end repeat
				end if
				set _address1 to my upper_cased(paragraph 1 of _content2)
				set _address2 to my upper_cased(paragraph 2 of _content2)
				if debug is true then
					log "_name: " & _name
					log "_address1: " & _address1
					log "_address2: " & _address2
				end if
				set my text item delimiters to {","}
				-- Not sure if it's just me, but on 10.4 I can't use more than the first delimiter in a list. FML...
				-- Isolate city from state and numbers at the end. Hopefully there is always a comma after the city.
				set _city to text item 1 of _address2
				set _state to my state_abr(words 1 thru ((count of words of text item 2 of _address2) - 1) of text item 2 of _address2)
				set _zip to last word of text item 2 of _address2
				if debug is true then
					log "_city: " & _city
					log "_state: " & _state
					log "_zip: " & _zip
				end if
				set theAddresses to theAddresses & _name & return & my address_abr(_address1) & return & _city & ", " & _state & " " & _zip & return & return
				set theCount to theCount + 1
			else
				set errors to "Selected e-mail subject does not contain \"" & _search & "\""
			end if
		on error theErr
			set errors to errors & theErr & return & return
		end try
	end repeat
	if theCount is not 0 then
		set the clipboard to theAddresses
		display dialog (theCount as string) & " addresses ready to paste." with icon 1
	else
		display dialog "No addresses were extracted from selected e-mails. :(" with icon 2
	end if
	if errors is not "" then display dialog "Some funny business was reported:" & return & return & errors
end tell

on upper_cased(_input)
	return (do shell script "echo " & _input & " | tr a-z A-Z")
end upper_cased

on state_abr(_state)
	set long_state to {"ALABAMA", "ALASKA", "ARIZONA", "ARKANSAS", "CALIFORNIA", "COLORADO", "CONNECTICUT", "DELAWARE", "FLORIDA", "GEORGIA", "HAWAII", "IDAHO", "ILLINOIS", "INDIANA", "IOWA", "KANSAS", "KENTUCKY", "LOUISIANA", "MAINE", "MARYLAND", "MASSACHUSETTS", "MICHIGAN", "MINNESOTA", "MISSISSIPPI", "MISSOURI", "MONTANA", "NEBRASKA", "NEVADA", "NEW HAMPSHIRE", "NEW JERSEY", "NEW MEXICO", "NEW YORK", "NORTH CAROLINA", "NORTH DAKOTA", "OHIO", "OKLAHOMA", "OREGON", "PENNSYLVANIA", "RHODE ISLAND", "SOUTH CAROLINA", "SOUTH DAKOTA", "TENNESSEE", "TEXAS", "UTAH", "VERMONT", "VIRGINIA", "WASHINGTON", "WEST VIRGINIA", "WISCONSIN", "WYOMING"}
	set short_state to {"AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"}
	repeat with n from 1 to count of long_state
		if item n of long_state is _state then return item n of short_state
	end repeat
	-- If someone doesn't know how to spell we need to send the original back.
	return _state
end state_abr

on address_abr(raw_address) -- This is some hot garbage right here.
	-- WE ARE... We are changing case to upper before sending to this function to avoid style differences. Why?
	-- Because `sed` on OS X cannot be used case insensitive (same for NetBSD apparently, and is POSIX compliant?).
	set long_address to {"NORTH", "SOUTH", "EAST", "WEST", "NORTHEAST", "NORTHWEST", "SOUTHEAST", "SOUTHWEST", "ROAD", "STREET", "AVENUE", "BOULEVARD", "PLACE", "CIRCLE", "DRIVE", "LANE", "ROUTE", "SUITE", "APARTMENT", "PARKWAY", "HIGHWAY", "EXPRESSWAY", "BYPASS", "CAUSEWAY", "STRAVENUE", "THROUGHWAY", "TURNPIKE", "VIADUCT"}
	set short_address to {"N", "S", "E", "W", "NE", "NW", "SE", "SW", "RD", "ST", "AVE", "BLVD", "PL", "CIR", "DR", "LN", "RTE", "STE", "APT", "PKWY", "HWY", "EXPY", "BYP", "CSWY", "STRA", "TRWY", "TPKE", "VIA"}
	repeat with s from 1 to count of long_address
		set raw_address to (do shell script "echo \"" & raw_address & "\" | sed 's/\\b" & item s of long_address & "\\b/" & item s of short_address & "/g'")
		-- Taking apart the `sed` command: s=replace; /=Regex delimiter; [[:<:]] & [[:>:]]=PATTERN deliniation (so we don't have to worry 
		-- about "NORTHERN" or "BROADWAY", it only matches whole words); g=all occurances of PATTERN
		-- sed 's/[[:<:]]PATTERN[[:>:]]/REPLACEMENT/g'
	end repeat
	return raw_address
end address_abr

StrawHousePig.net

Web things about junk and stuff

Leave a Reply Cancel reply