You could start with a robots.txt file: Meta Robots Tag & Robots.txt Tutorial for Google, Bing & Other Search Engines I believe the user agent for archive.org bot is: ia_archiver
That at least will allow you control bots that are obeying the robot.txt standard - obviously the naughty bots like email harvesters will ignore it and go ahead anyway, then you're in the world of trying to block by IP address and/or user-agent in htaccess and its a game of cat and mouse.
Good web hosts are pretty efficient at blocking a lot of the junk, if you're on a dedicated server and doing it yourself good luck
this resource might be handy: Bad Bots user-agent / bot
#Example blocking by user-agent in htaccess
RewriteEngine On RewriteCond %{HTTP_USER_AGENT} (BadBot|EmailGrabber|NaughtyBot) [NC] RewriteRule (.*) - [F,L] |
#Block some by IP addresses
RewriteCond %{REMOTE_ADDR} ^999\.999\.999\.999 [OR] RewriteCond %{REMOTE_ADDR} ^911\.911\.911\.911 [OR] RewriteCond %{REMOTE_ADDR} ^111\.222\.333\.444 RewriteRule (.*) - [F,L] |