Class NetworkCrawler

  • All Implemented Interfaces:
    DataProvider

    public class NetworkCrawler
    extends AbstractListCrawler<URL>
    Provider for data files directly fetched from network.

    This class handles a list of URLs pointing to data files or zip/jar on the net. Since the net is not a tree structure the list elements cannot be top elements recursively browsed as in DirectoryCrawler, they must be data files or zip/jar archives.

    The files fetched from network can be locally cached on disk. This prevents too frequent network access if the URLs are remote ones (for example original internet URLs).

    If the URL points to a remote server (typically on the web) on the other side of a proxy server, you need to configure the networking layer of your application to use the proxy. For a typical authenticating proxy as used in many corporate environments, this can be done as follows using for example the AuthenticatorDialog graphical authenticator class that can be found in the tests directories:

       System.setProperty("http.proxyHost",     "proxy.your.domain.com");
       System.setProperty("http.proxyPort",     "8080");
       System.setProperty("http.nonProxyHosts", "localhost|*.your.domain.com");
       Authenticator.setDefault(new AuthenticatorDialog());
     

    All registered filters are applied.

    Zip archives entries are supported recursively.

    This is a simple application of the visitor design pattern for list browsing.

    Author:
    Luc Maisonobe
    See Also:
    DataProvidersManager
    • Constructor Detail

      • NetworkCrawler

        public NetworkCrawler​(URL... inputs)
        Build a data classpath crawler.

        The default timeout is set to 10 seconds.

        Parameters:
        inputs - list of input file URLs
    • Method Detail

      • setTimeout

        public void setTimeout​(int timeout)
        Set the timeout for connection.
        Parameters:
        timeout - connection timeout in milliseconds
      • getCompleteName

        protected String getCompleteName​(URL input)
        Get the complete name of a input.
        Specified by:
        getCompleteName in class AbstractListCrawler<URL>
        Parameters:
        input - input to consider
        Returns:
        complete name of the input
      • getBaseName

        protected String getBaseName​(URL input)
        Get the base name of an input.
        Specified by:
        getBaseName in class AbstractListCrawler<URL>
        Parameters:
        input - input to consider
        Returns:
        base name of the input