This manual is for the Methanol Web Crawling System, version 1.7.0.
Understanding and running Methanol
Customization
Copying and license
Appendices
Indices
Methanol is a complete web crawling system aiming for optimal customizability. Methanol tries to be more than just a web crawling system, and gives you the option to modify how and what data should should be indexed, processed and ultimately displayed to the user.
Methanol's primary point is its web crawler Methabot. Methabot has been thoroughly
tested on thousands of different kinds of websites and website layouts. Methabot
is powered by its underlying web crawling library libmetha.
libmetha Feature OverviewThere are a few concepts that you should be familiar with when working with
Methanol systems. The probably most important thing to know is that a crawler,
in terms of the Methanol system, is not a function or process.
A crawler is merely a set of rules as to how web pages and files are crawled.
The actual process with threads crawling web pages or files should be refered to as a client with worker threads. Each worker may work independently on separate pages, but as long as they are running under the same process they share the same URL cache and set of rules.
A worker will always be directly connected to a single crawler at once. However, depending on how the configuration file looks, the worker may dynamically switch to another crawler.
The primary difference between any two crawlers is their lists of filetypes. One crawler might for example have filetype definitions for HTML files, while another crawler might have filetype definitions for video files. The worker is restricted to its current crawler's list of filetypes.
When a worker is running with a specific crawler, it might find URLs or data matching one of the crawler's filetypes. The filetype in turn can have parsers and attributes. A parser is in short a script or callback function that sets attributes. As an example, you could define a filetype named "HTML" with attributes such as "title" and "description". The parsers job is to extract data and set these attributes.
When a client is connected to a Methanol system, data can be uploaded to the system if the attributes talked about above were set by a parser. Once the data has been uploaded, they will be available in the MySQL database.
This chapter will introduce you to the various programs included in the Methanol suite.
mb – Methabot Command Line Toolmb is short for Methabot. Methabot is a handy command line tool for fetching and extracting data from the web or local files.
mbGet a list of available runtime options by invoking:
$ mb --help
Another useful command to know is:
$ mb --info
The above example will output a list of runtime information about the installed version of Methabot. Most interesting is the list of default configuration files that are installed.
To load and use any default configuration file, a colon prefix is used. As an example, to load "archive", run:
$ mb :archive
Merely loading a configuration will not do anything useful, but a URL must be provided. A URL can be provided on command line, any argument not beginning with "-" or ":" will be assumed to be a URL.
$ mb :archive www.gnome.org
The above command will download and parse the front page of gnome.org, and print a list of all archive files found.
This behaviour is generally the default behaviour of methabot, provide it with one configuration and one URL, and it will return a list of target URLs.
mb command line optionsThe easiest way of changing Methabot's behaviour is through command line. This way of configuring methabot is not as flexible as by writing a configuration file, but it is still very powerful and most importantly easy.
The following three sections will describe the three different kinds of command line configuration options available in Methabot.
A crawler option affects how the crawling is performed, think of it as behaviour options. Methabot (and libmetha in general) is capable of having multiple crawlers defined, and dynamically switching between crawlers at run time. The concept of crawler switching is however not covered in this chapter.
When configuring from command line, you will only be able to configure and change one crawler. By default, this will be the "default" crawler.
Filetype options are used to define your own target filetype.
Unlike crawler options, filetype options does not affect an
already defined filetype, unless you use the --filetype option.
These options affect the runtime configuration in general, such as how many workers (see Worker) to launch, proxy server settings or the user agent to use.
mb-client – The Web Crawler ClientThe Methabot System Client (mb-client) is very similar to
mb, but instead of communication with the user, mb-client
communicates with a Methanol system.
mb-client can not be configured directly. It will always
receive its configuration from a master server. mb-client
only requires an IP-address to a master server when it is started.
In order to give mb-client the login credentials to the
master server, it does however require a very minimal configuration file.
This file will usually be locate at /etc/mb-client.conf depending
on where you have installed Methanol. You can also use a custom configuration
file by providing it from command line.
If a default configuration file does not exist, mb-client
will attempt to connect to a master server running on 127.0.0.1:5505,
username default and password default.
mn-masterd – The Master Servermn-masterd, also known as the master server, is the heart of
a Methanol system. The purpose of the master is to distribute all
connected clients to slave servers, and to keep statistics and
configuration settings. The master is only required during system
startup and when connecting new slaves or clients to the system.
Once the system is up and running, no master is required.
When the master has been started it will listen for incoming connections, by default on port 5505, but that can be changed.
mn-masterd loads its configuration settings from mn-masterd.conf.
Normally this file is located at /etc/mn-masterd.conf. It is
also possible to give it another mn-masterd.conf using the
--config command line option.
Start mn-masterd by running the following command:
$ mn-masterd
mn-masterd will fork and run in the background. You can stop it
using madmind or by sending it the INT signal.
mn-slaved – The Slave Servermn-slaved is the layer between the client and the database.
mn-slaved manages clients, sessions, logging and runs system
script hooks.
madmind – Server Administration Toolmadmind is a command line tool for communicating with the Master server.
It is not available in this release, phpMadMind can be used instead.
Methanol is a part of the methabot package. The latest methabot
packages includes all the source code required to build any specific part of
the system you should want.
This section will describe how to compile and install Methanol. During this section of the manual, you will get to choose parts and configure your own installation depending on what programs in the Methanol suite you need.
To compile Methanol successfully, you will need the following package dependencies installed on your system, depending on which parts you plan to install:
| Package | client-side1 | server-side2
|
|---|---|---|
| MySQL (libmysqlclient) >= 5.0 | no | yes
|
| libcurl >= 7.16.0 | yes | no
|
| SpiderMonkey >= 1.7.0 | yes | no
|
| libev | only mb-client | yes
|
| pthread | yes | yes
|
configureExtract methabot-1.7.0.tar.gz and enter the created directory
using the following commands:
$ tar xzf methabot-1.7.0.tar.gz
$ cd methabot-1.7.0
Now its time to configure what parts of the system you would like to install.
For example, if you are installing a server you should most likely want to install mn-slaved and/or
mn-masterd only, and mb-client on several other client computers.
Or if you are only interested in the command line utility mb, you don't
want to compile the server daemons.
By default, the configure script will configure the command line utility only, and the server daemons will not be compiled. The reason why they are not compiled by default is because they are still experimental and will be moved to a separate package in the next release.
$ ./configure
The above command will configure methabot for installing the command line tool
mb only.
Configure for compiling and installing mn-masterd and mn-slaved
only using the following command:
$ ./configure --enable-slave --enable-master --disable-cli
Notice the --disable-cli option, by disabling the compilation of the
command line utility you don't have to have dependencies such as SpiderMonkey
installed if you only want the server part installed.
Configure for compiling and installing the client daemon using the following command:
$ ./configure --enable-client
Finally, to get a list of all available options, run the following command:
$ ./configure --help
Once you have configured the package to fit your needs, its time to compile the code. The two following commands will compile and install the parts of the system that you configured it to install. Note that you might need root privileges to invoke the second command:
$ make
$ make install
phpMadMind is an official extension to the Methanol system. It is a
set of PHP scripts for administrating a Methanol system. You can find
phpMadMind in the latest methabot package, under the src/
directory.
To install phpMadMind, you will first need a web server and PHP installed
You will also need the PHP extension simplexml installed.
Move the whole phpMadMind tree to any directory reachable from the web server,
and open up config.example.php. This file is pretty straight-forward
and you should be able to configure it on your own. Once you are done, save
it as config.php and navigate to the directory using a web browser.
You will be presented with a login screen, if you just installed your system there will be a default user login with username "default" and password "default".
This chapter will help you set up and test an initial installation of the server programs included in Methanol.
Installed on your system you should have two configuration files,
mn-masterd.example.conf and mn-slaved.example.conf.
Usually, they are installed to the /etc directory, but it
might depend on how you invoked configure during the
installation part. These two files are skeleton configuration files
for running the system for the first time. Rename them so
mn-masterd and mn-slaved can find them:
$ cd /etc
$ mv mn-masterd.example.conf mn-masterd.conf
$ mv mn-slaved.example.conf mn-slaved.conf
These two files require information so that both server daemons will be able to connect to the MySQL server. If you have a MySQL database and user ready, then you can fill that information in right now and skip the next section.
Security is always an important factor. Node
authentication is used to
prevent others from connecting a client or a slave server to your master
server. When you
start mn-masterd using the skeleton configuration, it will accept
any client or slave authenticating with the username default and
password default.
As long as you are testing the system locally and trust your local users, you should however be safe.
Refer to mn-masterd.conf: Configuring the Master Server
for information about how to set up node authentication.
To set up a database and user login for Methanol to use, first connect to the MySQL server, either through a web interface such as phpMyAdmin or from command line as shown below:
$ mysql --user root --password
Now there are three MySQL statements that we need to execute. First, you must create the actual database that Methanol uses, and secondly a user which Methanol will log in to MySQL using.
mysql> CREATE DATABASE `methanol`;
mysql> GRANT ALL PRIVILEGES ON `methanol`.* to `username`@`localhost`
IDENTIFIED BY 'password';
mysql> FLUSH PRIVILEGES;
Substitute methanol with a database name of choice, username with a username of choice and password with a password of choice. Also, you might want to modify the localhost to accept connections from other locations than the local host, depending on where you will be running the Methanol system and where the MySQL server is located.
Use the information you provided here to fill in the options in
mn-masterd.conf and mn-slaved.conf.
In many cases it might be feasible to run all system nodes on a single server. This section will give an example of how to set up such an environment.
Please walk through the chapter on installing Methanol, and configure Methanol with both the slave and the master daemon, and optionally the client daemon as well.
This example system environment will also prepare the slave for executing hook scripts.
Both mn-slaved and mn-masterd will change their
user id to the id of the user "nobody". This prevents them from
touching irrelevant parts of the system, if someone would
find and exploit a bug in the system.
A custom user still provides the same security as "nobody", but with support for executing hook scripts.
This example will create one user account that will be shared by both the master and the slave, we'll call it "mn-example".
$ useradd mn-example -m -s /sbin/nologin
The users home directory, /home/mn-example, will be used
to structure the various parts of the system.
Here is how we will layout this example system:
+ /home/mn-example/
+-- config/ : Configuration files should be put here
+-- run/ : Run-time files will be generated here
+-- hooks/ : All hook scripts should be put here
$ cd /home/mn-example
$ mkdir config run hooks
$ chown mn-example:mn-example *
$ chmod 700 *
Set the exec_dir slave option to /home/mn-example/run.
user and group should be set to mn-example, the
user we created (and its corresponding group).
Start the master server by invoking:
$ mn-masterd
The process will fork and run in the background. If an error occurs it will notify you and exit with a status code of 1.
If an error occurs and you think the error message reported in the
terminal doesn't help you find the cause of the error, then
check your system messages. Both mn-masterd and mn-slaved
uses syslog to log error messages and warnings.
Where these messages are stored depends on your syslog installation,
but most likely you get the latest messages by running:
$ tail /var/log/messages
Once mn-masterd is up and running, you can try connecting
mn-slaved to it. In this case, it is important that the master_host
option in mn-slaved.conf matches the listening address of the
master server.
mn-slaved will log its error and warning messages using
syslog as well.
This chapter will help you understand how configuration files are structured, how to create your own configuration files and modify existing.
mb and mb-clientThis section will help you understand how to configure Methabot
and mb-client. Please note that if you are configuring
mb-client, the configuration file should be put on the
same server as the active mn-masterd, and not on the local
host running the instance of mb-client.
Currently there are two kinds of classes you can create objects from in configuration files; crawlers and filetypes. Crawlers specify crawling behaviour, while each filetype specify properties for different filetypes such as audio files or HTML files.
A basic configuration file needs at least one crawler and one
filetype. The crawler should be named default, but the
filetype can be named anything. It is also possible to define
more than one crawler, but that topic will not be covered in
this tutorial.
Some modules (see Modules), such as lmm_mysql (see lmm_mysql), register extra functionality. They do this by registering a so-called scope. A scope is similiar to a filetype or crawler object, but does not require a name.
The most basic filetype requires only a name, and can be defined like below:
filetype["example"]
{
}
This will actually create an object of the class "filetype", with the name "example".
An empty filetype declaration isn't of much use. We must
provide it with information about how to actually match
URLs. This can be done in various ways, but extensions and
mimetypes are the two primary options you should be playing with:
filetype["example"]
{
extensions = {"png", "jpg", "jpeg"};
mimetypes = {"image/jpeg", "image/png"};
}
As you can see, both mimetypes and extensions
take arrays. The above code defines a filetype matching
png and jpeg files.
To try the filetype out, you must first include a default configuration. This is required because your current configuration does not define a default crawler. If you include "default.conf", you don't have to configure a crawler. Below is an example of including another configuration file:
include "default.conf"
filetype["example"]
{
extensions = {"png", "jpg", "jpeg"};
mimetypes = {"image/jpeg", "image/png"};
}
The include directive (see Directives)
literally inserts the contents of another configuration file
at the position of the include directive in this file.
Think of it as @import in CSS or #include in C.
The default.conf configuration will define a default crawler,
along with its own filetypes for crawling HTML and text files.
Hence, if you include default.conf and run with your
configuration, Methabot will be able to crawl websites and
concentrate on finding the filetype you have declared.
Now to try your configuration out, put the above code in a
file named example.conf. Move example.conf to ~/.methabot/
and run the following command:
$ mb :example anyurl.com/path/
Methabot will in the above case look for example.conf
first in ~/.methabot/ and then in the default
installation path. In short, your custom configurations will
override the installed configurations.
Unless the above command failed, you should see a list of all
the jpeg and png files found on that URL. Use the -D option
to specify the crawling depth:
$ mb :example anyurl.com/path/ -D 2
The parser option is used by filetypes to bind a parser.
When a URL matches this filetype, the parser will be
called to extract URLs or meta data. For example, you can
use the built-in parser "css" to extract URLs such as
images from CSS files:
include "default.conf"
filetype["example"]
{
extensions = {"png", "jpg", "jpeg"};
mimetypes = {"image/jpeg", "image/png"};
}
filetype["css"]
{
extensions = {"css"};
parser = "css";
}
Since default.conf defined filetypes for HTML crawling,
you now extended its functionality by also adding support for
crawling CSS files. Try running with the above configuration
and you should see that Methabot will crawl HTML as expected,
but also follow links to CSS files and try to find png and jpeg
files there.
Before we get started you must know that a crawler is merely a configuration, and not related to multi-threading in any way. A thread in libmetha is known as a worker, and workers can dynamically switch between different crawler configurations at any time. A crawler tells a worker exactly how to crawl a website, and what filetypes to look for.
Creating a custom crawler is more complex than creating a filetype. That is why the extend keyword is available. Using extend, you can modify the default crawler instead of creating your own from scratch:
include "default.conf"
extend: crawler["default"]
{
}
Inside the brackets you should set all variables you would like to modify from their default values. A tip is to have a look at the default configuration files and learn from them by example.
To modify the default depth limit of this crawler:
include "default.conf"
extend: crawler["default"]
{
depth_limit = 2;
}
extend: Modify existingThe extend keyword lets you modify an already defined object, and thus "extend" its attributes. This keyword is useful when you for example include a default configuration file just to modify a tiny setting. Here is an example:
include "default.conf"
extend: crawler["default"]
{
dynamic_url = "discard";
}
extend: filetype["html"]
{
parser = "blah";
}
The above will modify the crawler "default" defined in
default.conf, and also change the parser of the
filetype "html".
override: Replace existingThe override keyword works just like extend, except for the detail that it clears all settings in the target object before modifying it, and thus it overrides its definition completely. Example:
include "default.conf"
override: filetype["html"]
{
extensions = {"html"};
parser = "example";
}
copy: Copy existing to newUse this keyword to copy the settings from another filetype/crawler and base your object on those settings. Example:
include "default.conf"
filetype["html_2" copy "html"]
{
/* this defines the filetype "html_2" as a copy of "html" */
extensions = {"example"};
/* html_2 will now be identical to HTML, except for the extensions
* array which was explicitly set */
}
You can also combine the copy keyword with extend or override. Though note that even if you combine copy with extend, the result will be as if you combined copy with override, since copy replaces all empty values as well. Here is an example combining copy and override:
include "image.conf"
include "audio.conf"
override: filetype["image" copy "audio"]
{
/* 'image' is defined in image.conf, but since we copied 'audio'
* to it, the image filetype will now be identical to the audio
* filetype. Of course this does not make sense, it's just to
* demonstrate. */
mimetypes = {"image/jpeg"};
/* the 'image' filetype now requires audio file extensions with
* image file mimetypes! */
}
You can also simply copy filetypes to others, without modifying values, this can be pretty useful:
include "default.conf"
include "audio.conf"
include "image.conf"
override: filetype["audio" copy "image"];
override: filetype["html" copy "image"];
override: filetype["text" copy "image"];
/* you now have a completely messed up configuration,
* thinking HTML files are image files and so on. :) */
This section will describe some more advanced topics, you should have a good understanding of the basics before you attempt these topics.
Crawler switching is a feature in libmetha allowing the active worker (thread) to switch to another configuration without restarting the crawling session or losing its current state. The switch itself is done very fast.
The purpose of crawler switching is to allow the worker to crawl and parse different websites or different sub-pages in completely different ways. As an example, you could define a crawler specialized on crawling web forums, and one specialized in crawling RSS feeds or news sites, and then have a generic crawler for all other websites.
Workers can switch between crawler configurations freely. This is done
using the crawler_switch filetype option. A filetype that is
forcing a switch to another crawler configuration is often referred
to as a gateway filetype; when a worker matches a URL with this filetype
it will switch its current crawler configuration to what's specified
in crawler_switch.
Handlers are called before the parsers. The purpose of a handler is a to download the data referenced by the target URL. The handler can, but should not, modify the downloaded data.
A handler should make a decision whether the data should continue to the parser chain or be discarded, and return true or false.
The relationship between handlers and parsers is demonstrated below:
-------+
Worker | +------+
+->-[URL]-->-Handler-+ |[URL] |
| +-+[data]|-->-Parser Chain-->-[URL List]->-+
| +------+ |
| v
|<-------<-------<-------<-------<-------<-------<-------<------+
-------+
Handlers can be set using the handler filetype option or the default_handler
crawler option.
A handler can, just like a parser, be either a C or a Javascript function.
mn-masterd.conf: Configuring the Master Servermn-masterd is configured through mn-masterd.conf, normally
this file can be found at /etc/mn-masterd.conf. The layout of this file
is simple, all options should be put inside a scope named master, like this:
master {
listen = "127.0.0.1";
}
The above file will configure the listening address for the server.
Since mn-masterd will require both mn-slaved and mb-client
to authenticate once they have connected, you will of course need to
set authentication credentials.
To define a slave named "default", with password set to "pwd", allowed to log in from the local network:
slave["default"]
{
password = "pwd";
allow = {"192.168.1.0/24"};
}
The allow option defines where the slave is allowed to connect from.
This option expects an array of subnets, that is network addresses in combination
with subnet masks. Note that if you omitt allow completely, all incoming
connections will be accepted (assuming they login using the right password).
master Option Referencelistenconfig_filesession_complete_hookcleanup_hookusergroupmysql_hostmysql_usermysql_passmysql_dbuser Option Referenceslave Option Referencemn-slaved.conf: Configuring the Slave Servermn-slaved loads its configuration file from mn-slaved.conf, which normally resides at
/etc/mn-slaved.conf.
listenmaster_host127.0.0.1 is used.
master_portmaster_host. If left
unset, the default port number 5505 is used.
master_userslave[]
definition in mn-masterd.conf. If left unset, default is used.
master_passwordmaster_user.
If left unset, default is used.
usermn-slaved as.
groupmn-slaved as.
exec_diruser has permissions to execute
and write to the directory.
mysql_hostmysql_usermysql_passmysql_dbmb-client.conf: Configuring the Client Daemonmb-client.conf is used to set up login credentials for
the client daemon to use when logging in to the master server.
Configuration such as filetypes and crawlers can not be set in this
file, but they will be received from the master server once connected.
If mb-client.conf can not be found by the daemon, it will
assume the IP address of the master to be 127.0.0.1, the port 5505,
username default and password default.
master_host127.0.0.1 is used.
master_portmaster_host. If left
unset, the default port number 5505 is used.
master_usernameslave[]
definition in mn-masterd.conf. If left unset, default is used.
master_passwordmaster_user.
The concept of parsers is one of the most important in Methanol. A parser is responsible for extracting URLs, meta-data and settings attributes for each retrieved URL.
Currently, a parser can be programmed in either C or Javascript. Multiple parsers can be chained and share the same data buffer.
Parser chaining is a concept introduced in libmetha-1.6.0. Parser chaining allows multiple parsers to work on the same data and share their modifications.
data -> [ parser_1 -> parser_2 -> parser_3 ] -> output_data
| | |
v v v
+----------------------------------+
| List of found URLs |
+----------------------------------+
If parser_1 modified the data, parser_2 will receive the modified version, and so on. Parser chaining works with any type of parser, thus you can freely use either javascript or C parsers anywhere in the chain.
To set a filetype to use a parser chain, you should provide it with
a comma-separated list of parsers instead of just a single name of a
parser. As an example, if you have created your own javascript parser
that uses E4X to extract some metadata about a web page, you should
most likely want to send the data (HTML) to the xmlconv
builtin parser before your javascript parser receives it, to
avoid XML errors caused by the HTML.
filetype["your_filetype"]
{
parser = "xmlconv, yourfile.js/yourparser";
}
Furthermore, if your parser does not extract URLs but only extracts meta-information about the page, you can send it to the default HTML parser afterwards, which will extract all URLs for you:
filetype["your_filetype"]
{
parser = "xmlconv, yourfile.js/yourparser, html";
}
The following table lists the built-in parsers bundled with this version of libmetha.
html<style> tag is encountered, the css
parser is invoked on that piece. The same applies to plain text in the HTML source,
which is forwarded to the text parser.
texttext parser extracts URLs by searching for strings identifying
the start of a URL, such as http://.
csscss parser is primarily intended for when image files are to be
extracted or matched against a filetype. The css parser will
parse CSS declarations, searching for anything that might be URL. @import
is supported.
xmlconvutf8convContent-Type HTTP
header field, or a <meta> HTML element, and then converts the full text to
UTF-8. Should be used when connected to a Methanol system, and also before
parsing data with Javascript parsers. Note that running utf8conv will not
slow down the processing if the data is already UTF-8, so it's safe to always
call utf8conv.
entityconvutf8conv,
and then entityconv.
ftpMethabot can be scripted in Javascript with E4X. Scripting is useful for creating custom parsers and extracting data from complex XML.
Additionally to the standard Javascript functions, libmetha
provides the following functions that you can use:
print (str)println (str)getc ()exec (command)command in the default shell. Returns false on error. Returns
the full output (i.e. the data written to stdout by the child process)
as a string.
this objectthis is an object available in all javascript functions called
by a worker in libmetha. this is almost like a direct connection
to the actual worker. From this, you can get information such as HTTP
status codes and the current URL. A complete table of what information you
can get from it is displayed below.
status_codecontent_typethis.data.
urlthis.data.
dataset_attribute (name, value)name attribute to value for the current
URL.
E4X is a very easy-to-use but powerful and flexible extension to Javasceript. This tutorial
will help you get started with writing parsers for libmeth clients using
Javascript and E4X. You should have some previous experience with Javascript or
another procedural programming/scripting language before attempting this tutorial.
You should not attempt to understand this tutorial unless you are already familiar with configuration files, as otherwise you may have a hard time getting your script loaded into the client.
This tutorial will use mb for testing, but the same code and configuration
files can be used for mb-client or any client linked with libmetha.
A loadable parser should be a normal javascript function expecting zero arguments. All necessary information you will need will be available in an object called this. Your parser should return an array of all URLs it finds in the data, or null if no URL was found. The following snippet demonstrates a skeleton parser that always returns the same URL:
function test_parser()
{
println("It works!");
return Array("http://google.com/");
}
While it is possible to return a string instead of an Array, it is good practice to always return an Array, even in cases when only one URL is to be returned.
See Functions for more information about println.
To test the above parser, first save it to a file, let us say example.js.
Save example.js to the default script path. You can get the default script
path by invoking mb --info, by default mb has its default script
path set to /usr/share/metha/scripts, and also a user-specific script path
at ~/.methabot/scripts/. For a normal installation, save example.js,
to ~/.methabot/scripts/. Create a simple configuration file that
sets your parser as the default:
include "default.conf"
extend: filetype["html"]
{
parser = "example.js/test_parser";
}
Notice that when you set the parser option of the filetype, you
must include the name of the javascript file and the name of the function, separated
using a slash.
Test your parser by running mb with your configuration file. If it "does not work",
then try adding the -e flag to mb so Methabot won't discard google.com because
it is an external URL.
Now let us get back to this. this is an object directly connected
to a worker in libmetha. A worker is a thread, so each thread will have its
own this object when running javascript code. The this object contains
information such as the current URL and the current data downloaded from that URL.
For a full explanation of the this object, see this object.
Since most websites are coded in HTML, we will have to convert them to XML. Otherwise,
we can't process them using E4X. libmetha comes with a default parser named
xmlconv for converting HTML/XHTML to XML. To convert the data to XML before it
reaches your parser, you will have to set up a parser chain:
include "default.conf"
extend: filetype["html"]
{
parser = "xmlconv, example.js/test_parser";
}
The order of the two parsers is important. Now that the data is converted to XML,
we'll modify test_parser a little:
function test_parser()
{
var xml = new XML(this.data);
return xml..a.@href;
}
this.data is the data downloaded from the URL in question. xmlconv
will modify this.data before it is sent to your test_parser.
An XML object is created from this.data. Elements in the XML can be accessed
in a query-like manner – have a look at the return statement. After xml we have
two dots, think of them as an expression for "any element", or more specifically,
"any depth" in the XML tree. Next we have the a, and so far we thus have
something like "look for an <a> element in any depth of the XML tree". Finally, @href
returns the href attribute of the element. Since there can be multiple
matches, i.e. many <a> elements, an XMLList is returned. An XMLList is almost like
an array.
An important point to note once we've come this far is that you do not have to
worry about the URL syntax of what you return. If you return "blah" and the current
URL is http://www.metha-sys.org/, then libmetha will read your
return value as http://www.metha-sys.org/blah/. It will also remove anchors (i.e. "...#blah") and
normalize the URL in various ways. You don't have to do any sorting either, just return
all the URLs you find and libmetha will sort them into filetypes.
Back to the E4X, you can concat multiple XMLLists with the '+' operator:
function test_parser()
{
var x = new XML(this.data);
return x..a.@href + x..img.@src;
}
Now that we have talked about how to locate elements and how to access their attributes in E4X, let's move on to the next step; finding specific data. Sometimes it is prefered to extract one or several specific URLs from a specific part of the document. Consider the following HTML code:
<html>
<head>
<title>Example</title>
</head>
<body>
<div id="header">
<h1>Example Website</h1>
<a href="uninteresting">Uninteresting</a>
</div>
<div id="content>
<h2>Interesting URLs</h2>
<a href="test1">test1</a>
<a href="test2">test2</a>
</div>
</body>
</html>
Say we want to extract the two URLs in the div with ID content, we can use an E4X expression like this one:
function test_parser()
{
var x = new XML(this.data);
println("Title of current URL: " + x..title); /* just for fun */
return x..div.(@id == "content").a.@href;
}
Don't forget to add xmlconv to your parser chain any time you want to
parse HTML as XML.
Each crawler configuration can have its own init function. The init function will run before any crawling is started. The purpose of the init function is to interpret initial input arguments and convert them to URLs. To visualize what's meant with input arguments, consider the following Methabot command:
$ mb -p 2 -d google.com test.com example.com
In this case, the input arguments are three; google.com, test.com and example.com. Normally, these arguments are URLs. However, an init function interpret the input arguments in any way and build its own URLs from them.
By default, a default init function will be used. This default init function will convert the arguments you give it to full URLs. In the case above it would have been a matter of adding http:// in front of the all three only. This section will explain how to override the default init function and roll your own.
To demonstrate the init function's usefulness, we will write a simple init function that uses the arguments sent to it to build google search string.
The first thing you should know is that unlike parser functions, init functions expect an argument. This argument will be the array of input data, one element per argument on command line. Thus, you will have to loop through the array to "build" your URLs for each argument. Here is a simple example:
function init_example(args)
{
var ret = Array();
for (var x=0; x<args.length; x++)
ret.push("http://www.google.com/search?q="+args[x]);
return ret;
}
To test an init function, the init crawler option is used:
crawler["..."]
{
...
init = "js_file.js/init_example";
...
}
This chapter will talk about hook scripts. Hook scripts are used in a Methanol system for customization.
The scripts must be put on the master server. The master server will then distribute the scripts to the slave servers. Finally, the slave server(s) will execute the scripts.
Depending on what hook you are writing a script for, the script will be
executed when different events occur in the system. For example, the cleanup
hook is executed before a slave exits, and the session_complete hook is
executed when a client has uploaded information about a web page.
A system hook can be scripted in any scripting language, as long as the language is supported by the slave server. This manual will however concentrate on hooks scripted in PHP.
To set up a hook, set the hook's corresponding mn-masterd option to
the full path of the script file. The file can be placed anywhere on the system. Example:
master {
...
session_complete_hook = "/home/example/example.php";
...
}
The file itself, in this case example.php, must always begin with a line telling
the operating system which interpreter to use. This is the line you see in any script file
on your system. For example, a PHP file might begin with:
#!/usr/bin/php
... script ...
A bash file might begin with:
#!/bin/bash
... script ...
But the full path to the interpreter depends on where it is installed on the system.
Please note that the scripts will be run on slave servers, and not on the master server, so even though you put the script on the master, you must make sure the path to the interpreter applies on the slave server(s).
mn-slavedmn-slaved will save the scripts received from the master to one file per script. To
do this, the slave must have write permissions to a directory in the system.
First we create a directory where mn-slaved will be working. You can choose to create
it anywhere in the system, as long as that partition is not mounted with noexec, but
in this manual we will create it in /var/mn-slaved.
$ mkdir /var/mn-slaved
$ useradd mn-slaved -s /sbin/nologin -d /dev/null
$ chown mn-slaved:mn-slaved /var/mn-slaved
Now we can set up mn-slaved to use our new system user when forking. Change the user and
group options of in mn-slaved.conf to "mn-slaved" both.
The exec_dir option must also be set, set it to the directory you created. To summarize:
slave {
...
user = "mn-slaved";
group = "mn-slaved";
exec_dir = "/var/mn-slaved";
}
This section will describe each hook, how it will be invoked and when it will be invoked.
cleanupThe cleanup hook will be invoked before the slave exits. This hook can be useful
for cleaning up custom data added to the database, which is no longer needed after
the slave has disconnected.
See the cleanup_hook mn-masterd option.
session_completeThe session_complete hook will be invoked once a session completes.
When a slave sends a URL to the client, a session is started.
This session will last until the client is out of URLs again. Note that
the definition of a session is dependant on how the clients' crawlers are configured,
such as depth limit.
See the session_complete_hook mn-masterd option.
lmm_mysql – Javascript-MySQL Bindingslmm_hash – Functions for calculating checksumslmm_file – C-like file and directory handlingThis manual is for Methanol Web Crawling System, version 1.7.0.
Copyright © 2008, 2009 Emil Romanus <sdac@bithack.se>
Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
Web crawlers such as mb and mb-client
can download a file named robots.txt from the root directory of a web server. This file
will contain rules as to which pages or parts of the website the bot may access and not. This
system is known as the robots exclusion standard.
Enabling support for robots.txt is recommended when doing heavy crawling.
You can enable robots.txt from command line using:
$ mb --robotstxt
You can also enable it per-crawler from a configuration file:
crawler["example"]
{
robotstxt = true;
}
libmetha has wide support for the robots exclusion standard.
The following table lists which directives that are supported.
User-agent User-agent: Googlebot
Disallow: /
User-agent: Methabot
Allow: /
DisallowDisallow: /private-directory/
Allow Disallow: /private/
Allow: /private/public.html
Note that this is a non-standard extensions, though it is supported by most web crawlers.