-
- Index-html Plugin for apache nutch 2.xIndex HTML content of the pages in Apaache Nutch 2.x ( https://github.com/Meabed/nutch2-index-html.git )
--
- - Instruction:-
-
-
-
-
- - Compile from Source-
-
- Download the plugin folder "index-html" and copy it to you Apache nutch 2 plugin directory ( ex: apache-nutch-2.x.x/src/plugin ) -
- Add the ( index-html ) plugin to The plugin folder build.xml ( apache-nutch-2.x.x/src/plugin/build.xml ) in target ( deploy and clean ) so the file will look like -
- -<target name="deploy"> - ....... - <ant dir="index-basic" target="deploy"/> - <ant dir="index-more" target="deploy"/> - <ant dir="index-html" target="deploy"/> - <ant dir="language-identifier" target="deploy"/> - ......... -</target> - -<target name="clean"> - ....... - <ant dir="index-basic" target="deploy"/> - <ant dir="index-more" target="deploy"/> - <ant dir="index-html" target="deploy"/> - <ant dir="language-identifier" target="deploy"/> - ......... -</target> -
--
-
- Run ( ant runtime ) in apache nutch 2 root folder to start the build -
- You should have index-html.jar in build folder -
- Enable the plugin by adding it to nutch-sites.xml ( or nutch-default.xml ) like beloe : -
- -<configuration> - .......... - <property> - <name>plugin.includes</name> - <value>...........someplugins....|index-html</value> - </property> - .......... - </configuration>
--
-
- The plugin will add new Field "rawcontent" to the Nutch Doc, To index this field you need to add it to ( scheme.xml or schema-solr4.xml ) like -
-- -<field name="rawcontent" type="text" sstored="true" indexed="true" multiValued="false"/>
--
-
- Run the crawler and you should see the new field rawcontent in index! -
- -
-
-
- - Use Pre-Compiled Library-
-
- In The repo there is Build folder contain compiled .jar library ready for use. -
- Copy the library to your runtime path local if you are running the plugin locally ( apache-nutch-2.x.x/runtime/local/plugins ) Then follow the above steps to configure nutch-sites.xml -
-