Skip to content

Commit

Permalink
Merge pull request #2 from base-cms/do-not-extract
Browse files Browse the repository at this point in the history
Tweak heading adjustments and do not extract
  • Loading branch information
zarathustra323 authored Apr 10, 2019
2 parents 2e199eb + 017c3c4 commit 661df22
Show file tree
Hide file tree
Showing 5 changed files with 162 additions and 181 deletions.
43 changes: 3 additions & 40 deletions src/rules/pennwell/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,45 +2,8 @@

## Default
To access, send a `POST` request to `/pennwell/default`. This rule set performs the following operations:
- Removes duplicative whitespace values (via `html.replace(/\s\s+/g, '')`)
- Extracts the `deck` text from elements classed with `.paraStyle_headline_deck` and removes the element from the cleaned HTML.
- Extracts an `author` object from elements classed with `.paraStyle_byline` or `.paraStyle_body_bio` and removes the elements from the cleaned HTML.
- If an `<h1>` is detected anywhere in the body, all heading elements are increased by one (e.g. `<h1>` becomes `<h2>`, `<h2>` becomes `<h3>`, etc).
- Removes duplicative whitespace values
- If an `<h1>` or `<h2>` is detected anywhere in the body, all heading elements are increased by two (e.g. `<h1>` becomes `<h3>`, `<h2>` becomes `<h4>`, etc).
- Removes all `<form>` and `<style>` elements.
- Removes all `id`, `class`, `style` and `data-*` attributes from elements.
- Removes PennNet.com iframe embeds, e.g. where `iframe[src*="pennnet.com"]`.

### Examples

#### Request
```html
<h4 class="paraStyle_headline_deck">Put Drivers in Safe Hands with Telematics</h4>
<h2 class="paraStyle_byline">By Jenny Shiner</h2>

<p>Test</p>

<p class="paraStyle_body">At the end of the day, when considering the objectives of a telematics implementation, no reasoning is quite as important as increasing safety for employees and the general public on the roadways. Using telematics as part of a fleetwide safety initiative will drive the program miles forward while providing the business with several other impactful benefits. UP </p>
<p class="paraStyle_body_bio"><strong class="charStyle_bold">The Author: </strong></p>

<p class="paraStyle_body_bio"><img src="//aemstatic-ww2.azureedge.net/content/dam/up/print-articles/volume-23/issue-2/1902UPpf2-a01.jpg" alt="" width="167" height="167"></p>

<p class="paraStyle_body_bio">Jenny Shiner is the communications manager for GPS Insight. She graduated from Arizona State University with a bachelor’s degree in communication and is responsible for communication for all business segments that GPS Insight targets. For more information on telematics and fuel card technologies, visit www.gpsinsight.com.</p>
```

#### Response
```json
{
"extracted": {
"deck": "Put Drivers in Safe Hands with Telematics",
"author": {
"name": "Jenny Shiner",
"image": "//aemstatic-ww2.azureedge.net/content/dam/up/print-articles/volume-23/issue-2/1902UPpf2-a01.jpg",
"bio": "<p><strong class=\"charStyle_bold\">The Author: </strong></p><p>Jenny Shiner is the communications manager for GPS Insight. She graduated from Arizona State University with a bachelor&#x2019;s degree in communication and is responsible for communication for all business segments that GPS Insight targets. For more information on telematics and fuel card technologies, visit www.gpsinsight.com.</p>"
}
},
"html": {
"cleaned": "<p>Test</p><p class=\"paraStyle_body\">At the end of the day, when considering the objectives of a telematics implementation, no reasoning is quite as important as increasing safety for employees and the general public on the roadways. Using telematics as part of a fleetwide safety initiative will drive the program miles forward while providing the business with several other impactful benefits. UP </p>",
"original": "<h4 class=\"paraStyle_headline_deck\">Put Drivers in Safe Hands with Telematics</h4>\n\t\t\t<h2 class=\"paraStyle_byline\">By Jenny Shiner</h2>\n\n<p>Test</p>\n\n<p class=\"paraStyle_body\">At the end of the day, when considering the objectives of a telematics implementation, no reasoning is quite as important as increasing safety for employees and the general public on the roadways. Using telematics as part of a fleetwide safety initiative will drive the program miles forward while providing the business with several other impactful benefits. UP </p>\n\t\t\t<p class=\"paraStyle_body_bio\"><strong class=\"charStyle_bold\">The Author: </strong></p>\n\n<p class=\"paraStyle_body_bio\"><img src=\"//aemstatic-ww2.azureedge.net/content/dam/up/print-articles/volume-23/issue-2/1902UPpf2-a01.jpg\" alt=\"\" width=\"167\" height=\"167\"></p>\n\n<p class=\"paraStyle_body_bio\">Jenny Shiner is the communications manager for GPS Insight. She graduated from Arizona State University with a bachelor’s degree in communication and is responsible for communication for all business segments that GPS Insight targets. For more information on telematics and fuel card technologies, visit www.gpsinsight.com.</p>"
}
}
```
- Removes PennNet.com iframe embeds, e.g. where `iframe[src*="pennnet.com"]`.\
97 changes: 47 additions & 50 deletions src/rules/pennwell/default.js
Original file line number Diff line number Diff line change
Expand Up @@ -14,57 +14,57 @@ const removeAttrs = ($) => {

const loadHTML = html => cheerio.load(html, { decodeEntities: false });

const cleanTextValue = v => (v || '').replace(/\s+/g, ' ').trim();

const extractDeck = ($) => {
const className = '.paraStyle_headline_deck';
const element = $(className);
if (!element.length) return null;
const deck = cleanTextValue(element.text()) || null;
element.replaceWith('');
return deck;
};

const cleanBio = (bio) => {
if (!bio) return null;
const $ = loadHTML(bio);
removeAttrs($);
return $('body').html();
};

const extractAuthor = ($) => {
const bylineClass = '.paraStyle_byline';
const bioClass = '.paraStyle_body_bio';

const name = cleanTextValue($(bylineClass).text()).replace(/^by/i, '').trim();

let image = null;
let bio = '';

$(bioClass).each(function () {
const imgElement = $(this).children('img');
if (imgElement.length) {
image = imgElement.attr('src');
} else {
bio = `${bio}<p>${$(this).html()}</p>`;
}
});

$(bylineClass).replaceWith('');
$(bioClass).replaceWith('');
return {
name: name || null,
image: image || null,
bio: cleanBio(bio),
};
};
// const cleanTextValue = v => (v || '').replace(/\s+/g, ' ').trim();

// const extractDeck = ($) => {
// const className = '.paraStyle_headline_deck';
// const element = $(className);
// if (!element.length) return null;
// const deck = cleanTextValue(element.text()) || null;
// element.replaceWith('');
// return deck;
// };

// const cleanBio = (bio) => {
// if (!bio) return null;
// const $ = loadHTML(bio);
// removeAttrs($);
// return $('body').html();
// };

// const extractAuthor = ($) => {
// const bylineClass = '.paraStyle_byline';
// const bioClass = '.paraStyle_body_bio';

// const name = cleanTextValue($(bylineClass).text()).replace(/^by/i, '').trim();

// let image = null;
// let bio = '';

// $(bioClass).each(function () {
// const imgElement = $(this).children('img');
// if (imgElement.length) {
// image = imgElement.attr('src');
// } else {
// bio = `${bio}<p>${$(this).html()}</p>`;
// }
// });

// $(bylineClass).replaceWith('');
// $(bioClass).replaceWith('');
// return {
// name: name || null,
// image: image || null,
// bio: cleanBio(bio),
// };
// };

module.exports = async (body) => {
const html = stripWhitespace(body);
const $ = loadHTML(html);

const deck = extractDeck($);
const author = extractAuthor($);
// const deck = extractDeck($);
// const author = extractAuthor($);

adjustHeadings($);

Expand All @@ -78,10 +78,7 @@ module.exports = async (body) => {
removeAttrs($);

return {
extracted: {
deck,
author,
},
extracted: {},
html: {
cleaned: $('body').html(),
original: body,
Expand Down
5 changes: 3 additions & 2 deletions src/utils/adjust-headings.js
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,12 @@ const cheerio = require('cheerio');
const selector = 'h1, h2, h3, h4, h5';

module.exports = ($) => {
if ($('h1').length) {
if ($('h1').length || $('h2').length) {
$(selector).each(function () {
const tag = $(this).prop('tagName').toLowerCase();
const [, num] = [...tag];
const newTag = `h${Number(num) + 1}`;
const n = Number(num);
const newTag = `h${n < 5 ? n + 2 : n + 1}`;
const { attribs } = $(this)[0];

const $new = cheerio.load(`<span><${newTag}>${$(this).html()}</${newTag}></span>`)('span');
Expand Down
163 changes: 81 additions & 82 deletions test/rules/pennwell/default.spec.js
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
/* eslint-disable max-len */

const rule = require('../../../src/rules/pennwell/default');

describe('rules/pennwell/default', () => {
Expand Down Expand Up @@ -88,14 +90,11 @@ describe('rules/pennwell/default', () => {
</div>
`;
const result = await rule(body);
expect(result.html.cleaned).to.equal('<div><h2>Foo</h2><h3>Bar</h3><h3>Bar</h3><div><h4>Foo</h4><h4>Foo</h4><h5>Foo</h5><h6>Foo</h6><h6>Foo</h6></div></div>');
expect(result.html.cleaned).to.equal('<div><h3>Foo</h3><h4>Bar</h4><h4>Bar</h4><div><h5>Foo</h5><h5>Foo</h5><h6>Foo</h6><h6>Foo</h6><h6>Foo</h6></div></div>');
});
it('should not adjust heading elements when an <h1> is not present.', async () => {
it('should not adjust heading elements when an <h1> or <h2> is not present.', async () => {
const body = `
<div>
<h2>Foo</h2>
<h2>Bar</h2>
<h2>Bar</h2>
<div>
<h3>Foo</h3>
<h3>Foo</h3>
Expand All @@ -106,7 +105,7 @@ describe('rules/pennwell/default', () => {
</div>
`;
const result = await rule(body);
expect(result.html.cleaned).to.equal('<div><h2>Foo</h2><h2>Bar</h2><h2>Bar</h2><div><h3>Foo</h3><h3>Foo</h3><h4>Foo</h4><h5>Foo</h5><h6>Foo</h6></div></div>');
expect(result.html.cleaned).to.equal('<div><div><h3>Foo</h3><h3>Foo</h3><h4>Foo</h4><h5>Foo</h5><h6>Foo</h6></div></div>');
});
it('should remove `class` attributes.', async () => {
const body = `
Expand Down Expand Up @@ -144,80 +143,80 @@ describe('rules/pennwell/default', () => {
const result = await rule(body);
expect(result.html.cleaned).to.equal('<div><span>Bar</span></div>');
});
it('should extract a deck value when present.', async () => {
const body = `
<div>
<h4 class="paraStyle_headline_deck"> Put Drivers in
Safe Hands with Telematics</h4>
<p>Foo</p>
</div>
`;
const result = await rule(body);
expect(result.extracted.deck).to.equal('Put Drivers in Safe Hands with Telematics');
});
it('should return a null deck when elements are present but are empty.', async () => {
const body = `
<div>
<h4 class="paraStyle_headline_deck"></h4>
<p>Foo</p>
</div>
`;
const result = await rule(body);
expect(result.extracted.deck).to.equal(null);
});
it('should remove the deck elements when present.', async () => {
const body = `
<div>
<h4 class="paraStyle_headline_deck"> Put Drivers in
Safe Hands with Telematics</h4>
<p>Foo</p>
</div>
`;
const result = await rule(body);
expect(result.html.cleaned).to.equal('<div><p>Foo</p></div>');
});
it('should extract an author name when present.', async () => {
const body = `
<div>
<h2 class="paraStyle_byline">By Jenny
Shiner</h2>
</div>
`;
const result = await rule(body);
expect(result.extracted.author.name).to.equal('Jenny Shiner');
});
it('should extract an author image when present.', async () => {
const body = `
<div>
<p class="paraStyle_body_bio"><img src="//aemstatic-ww2.azureedge.net/content/dam/up/print-articles/volume-23/issue-2/1902UPpf2-a01.jpg" alt="" width="167" height="167"></p>
</div>
`;
const result = await rule(body);
expect(result.extracted.author.image).to.equal('//aemstatic-ww2.azureedge.net/content/dam/up/print-articles/volume-23/issue-2/1902UPpf2-a01.jpg');
});
it('should extract an author bio when present.', async () => {
const body = `
<div>
<p class="paraStyle_body_bio"><strong class="charStyle_bold">The Author: </strong></p>
<p class="paraStyle_body_bio"><img src="//aemstatic-ww2.azureedge.net/content/dam/up/print-articles/volume-23/issue-2/1902UPpf2-a01.jpg" alt="" width="167" height="167"></p>
<p class="paraStyle_body_bio">Jenny Shiner is the communications manager for GPS Insight. She graduated from Arizona State University with a bachelor’s degree in communication and is responsible for communication for all business segments that GPS Insight targets. For more information on telematics and fuel card technologies, visit www.gpsinsight.com.</p>
</div>
`;
const result = await rule(body);
expect(result.extracted.author.bio).to.equal('<p><strong>The Author:</strong></p><p>Jenny Shiner is the communications manager for GPS Insight. She graduated from Arizona State University with a bachelor’s degree in communication and is responsible for communication for all business segments that GPS Insight targets. For more information on telematics and fuel card technologies, visit www.gpsinsight.com.</p>');
});
it('should remove the author elements when present.', async () => {
const body = `
<div>
<h2 class="paraStyle_byline">By Jenny Shiner</h2>
<p>Foo</p>
<p class="paraStyle_body_bio"><strong class="charStyle_bold">The Author: </strong></p>
<p class="paraStyle_body_bio"><img src="//aemstatic-ww2.azureedge.net/content/dam/up/print-articles/volume-23/issue-2/1902UPpf2-a01.jpg" alt="" width="167" height="167"></p>
<p class="paraStyle_body_bio">Jenny Shiner is the communications manager for GPS Insight. She graduated from Arizona State University with a bachelor’s degree in communication and is responsible for communication for all business segments that GPS Insight targets. For more information on telematics and fuel card technologies, visit www.gpsinsight.com.</p>
<p>Bar</p>
</div>
`;
const result = await rule(body);
expect(result.html.cleaned).to.equal('<div><p>Foo</p><p>Bar</p></div>');
});
// it('should extract a deck value when present.', async () => {
// const body = `
// <div>
// <h4 class="paraStyle_headline_deck"> Put Drivers in
// Safe Hands with Telematics</h4>
// <p>Foo</p>
// </div>
// `;
// const result = await rule(body);
// expect(result.extracted.deck).to.equal('Put Drivers in Safe Hands with Telematics');
// });
// it('should return a null deck when elements are present but are empty.', async () => {
// const body = `
// <div>
// <h4 class="paraStyle_headline_deck"></h4>
// <p>Foo</p>
// </div>
// `;
// const result = await rule(body);
// expect(result.extracted.deck).to.equal(null);
// });
// it('should remove the deck elements when present.', async () => {
// const body = `
// <div>
// <h4 class="paraStyle_headline_deck"> Put Drivers in
// Safe Hands with Telematics</h4>
// <p>Foo</p>
// </div>
// `;
// const result = await rule(body);
// expect(result.html.cleaned).to.equal('<div><p>Foo</p></div>');
// });
// it('should extract an author name when present.', async () => {
// const body = `
// <div>
// <h2 class="paraStyle_byline">By Jenny
// Shiner</h2>
// </div>
// `;
// const result = await rule(body);
// expect(result.extracted.author.name).to.equal('Jenny Shiner');
// });
// it('should extract an author image when present.', async () => {
// const body = `
// <div>
// <p class="paraStyle_body_bio"><img src="//aemstatic-ww2.azureedge.net/content/dam/up/print-articles/volume-23/issue-2/1902UPpf2-a01.jpg" alt="" width="167" height="167"></p>
// </div>
// `;
// const result = await rule(body);
// expect(result.extracted.author.image).to.equal('//aemstatic-ww2.azureedge.net/content/dam/up/print-articles/volume-23/issue-2/1902UPpf2-a01.jpg');
// });
// it('should extract an author bio when present.', async () => {
// const body = `
// <div>
// <p class="paraStyle_body_bio"><strong class="charStyle_bold">The Author: </strong></p>
// <p class="paraStyle_body_bio"><img src="//aemstatic-ww2.azureedge.net/content/dam/up/print-articles/volume-23/issue-2/1902UPpf2-a01.jpg" alt="" width="167" height="167"></p>
// <p class="paraStyle_body_bio">Jenny Shiner is the communications manager for GPS Insight. She graduated from Arizona State University with a bachelor’s degree in communication and is responsible for communication for all business segments that GPS Insight targets. For more information on telematics and fuel card technologies, visit www.gpsinsight.com.</p>
// </div>
// `;
// const result = await rule(body);
// expect(result.extracted.author.bio).to.equal('<p><strong>The Author:</strong></p><p>Jenny Shiner is the communications manager for GPS Insight. She graduated from Arizona State University with a bachelor’s degree in communication and is responsible for communication for all business segments that GPS Insight targets. For more information on telematics and fuel card technologies, visit www.gpsinsight.com.</p>');
// });
// it('should remove the author elements when present.', async () => {
// const body = `
// <div>
// <h2 class="paraStyle_byline">By Jenny Shiner</h2>
// <p>Foo</p>
// <p class="paraStyle_body_bio"><strong class="charStyle_bold">The Author: </strong></p>
// <p class="paraStyle_body_bio"><img src="//aemstatic-ww2.azureedge.net/content/dam/up/print-articles/volume-23/issue-2/1902UPpf2-a01.jpg" alt="" width="167" height="167"></p>
// <p class="paraStyle_body_bio">Jenny Shiner is the communications manager for GPS Insight. She graduated from Arizona State University with a bachelor’s degree in communication and is responsible for communication for all business segments that GPS Insight targets. For more information on telematics and fuel card technologies, visit www.gpsinsight.com.</p>
// <p>Bar</p>
// </div>
// `;
// const result = await rule(body);
// expect(result.html.cleaned).to.equal('<div><p>Foo</p><p>Bar</p></div>');
// });
});
Loading

0 comments on commit 661df22

Please sign in to comment.