public class NetworkCrawler extends AbstractListCrawler<URL>
This class handles a list of URLs pointing to data files or zip/jar on
the net. Since the net is not a tree structure the list elements
cannot be top elements recursively browsed as in DirectoryCrawler
, they must be data files or zip/jar archives.
The files fetched from network can be locally cached on disk. This prevents too frequent network access if the URLs are remote ones (for example original internet URLs).
If the URL points to a remote server (typically on the web) on the other side of a proxy server, you need to configure the networking layer of your application to use the proxy. For a typical authenticating proxy as used in many corporate environments, this can be done as follows using for example the AuthenticatorDialog graphical authenticator class that can be found in the tests directories:
System.setProperty("http.proxyHost", "proxy.your.domain.com"); System.setProperty("http.proxyPort", "8080"); System.setProperty("http.nonProxyHosts", "localhost|*.your.domain.com"); Authenticator.setDefault(new AuthenticatorDialog());
All registered
filters
are applied.
Zip archives entries are supported recursively.
This is a simple application of the visitor
design pattern for
list browsing.
DataProvidersManager
ZIP_ARCHIVE_PATTERN
Constructor and Description |
---|
NetworkCrawler(URL... inputs)
Build a data classpath crawler.
|
Modifier and Type | Method and Description |
---|---|
protected String |
getBaseName(URL input)
Get the base name of an input.
|
protected String |
getCompleteName(URL input)
Get the complete name of a input.
|
protected InputStream |
getStream(URL input)
Get the stream to read from an input.
|
protected ZipJarCrawler |
getZipJarCrawler(URL input)
Get a zip/jar crawler for an input.
|
void |
setTimeout(int timeout)
Set the timeout for connection.
|
addInput, feed, getInputs
public NetworkCrawler(URL... inputs)
The default timeout is set to 10 seconds.
inputs
- list of input file URLspublic void setTimeout(int timeout)
timeout
- connection timeout in millisecondsprotected String getCompleteName(URL input)
getCompleteName
in class AbstractListCrawler<URL>
input
- input to considerprotected String getBaseName(URL input)
getBaseName
in class AbstractListCrawler<URL>
input
- input to considerprotected ZipJarCrawler getZipJarCrawler(URL input)
getZipJarCrawler
in class AbstractListCrawler<URL>
input
- input to considerprotected InputStream getStream(URL input) throws IOException
getStream
in class AbstractListCrawler<URL>
input
- input to read fromIOException
- if the input cannot be opened for readingCopyright © 2002-2023 CS GROUP. All rights reserved.