NOTLARIMDAN...: Regular Expression

Regular Expression etiketine sahip kayıtlar gösteriliyor. Tüm kayıtları göster

18 Aralık 2014 Perşembe

RegExp üzerine derleme notlar

2 Kasım 2010 Salı

Javascript ile metin içinde url bağlantısını A etiketi ile sarmak


<script type="text/javascript">
    var metin = "blah http://www.yahoo.com ile Yahoo adresine http://google.com ile Google a bağlanabilirsiniz.";
    var sonuc = metin.replace(/http:\/\/([a-z0-9.-]+)/g, '$1');
    alert(sonuc);
    document.write(sonuc);
</script>

1 Kasım 2010 Pazartesi

Javascript ile Regular Expression

http://www.javascriptkit.com/jsref/regexp.shtml

var kaynak = "Bir rakamdan sonra 1 yazi aranacak.";

var filitre = /(\d+) yazi/;
var sonuc = kaynak .match( filitre );

console.debug(sonuc) ;


// ------ YA DA ----------------


var regEx = RegExp(/(\d+) yazi/);
var digerSonuc = regEx.exec(kaynak);

console.debug( digerSonuc );

http://stackoverflow.com/questions/369147/javascript-regex-to-extract-anchor-text-and-url-from-anchor-tags

var matches =[];

input_content.replace(/[^<]*(<a href="([^"]+)">([^<]+)<\/a>)/g, function(){

    matches.push(Array.prototype.slice.call(arguments, 1, 4))
});

This assumes that your anchors will always be in the form <a href="...">...</a> i.e. it won't work if there are any other attributes (for example, target). The regular expression can be improved to accommodate this.

To break down the regular expression:

/ -> start regular expression
  [^<]* -> skip all characters until the first <

  ( -> start capturing first token
    <a href=" -> capture first bit of anchor
    ( -> start capturing second token
        [^"]+ -> capture all characters until a "
    ) -> end capturing second token
    "> -> capture more of the anchor
    ( -> start capturing third token
        [^<]+ -> capture all characters until a <

    ) -> end capturing third token
    <\/a> -> capture last bit of anchor
  ) -> end capturing first token
/g -> end regular expression, add global flag to match all anchors in string

Each call to our anonymous function will receive three tokens as the second, third and fourth arguments, namely arguments[1], arguments[2], arguments[3]:

arguments[1] is the entire anchor

arguments[2] is the href part

arguments[3] is the text inside

We'll use a hack to push these three arguments as a new array into our main matches array. The arguments built-in variable is not a true JavaScript Array, so we'll have to apply the split Array method on it to extract the items we want:

Array.prototype.slice.call(arguments, 1, 4)

This will extract items from arguments starting at index 1 and ending (not inclusive) at index 4.

var input_content = "blah \    <a href=\"http://yahoo.com\">Yahoo</a> \
    blah \
    <a href=\"http://google.com\">Google</a> \
    blah";

var matches = [];

input_content.replace(/[^<]*(<a href="([^"]+)">([^<]+)<\/a>)/g, function () {
  matches.push(Array.prototype.slice.call(arguments, 1, 4));
});
alert(matches.join("\n"));

Gives:

<a href="http://yahoo.com">Yahoo</a>,http://yahoo.com,Yahoo
<a href="http://google.com">Google</a>,http://google.com,Google

31 Aralık 2009 Perşembe

Regular Expression

Önce hatırlatma:

METACHARECTERS (11 adet)	Yazdırılamayan Karakterler
[ \ ^ $ . \| ? * + ( )	\t tab (0×09) (9) \r carriage return (0×0D) (13) \n new line (0×0A) (10) \a bell (0×07) (7) \e escape (0×1B) (27) \f form feed (0×0C) (12) \v vertical tab (0×0B) (11)

Aşağıdaki metin üzerinden regular expression ifadelerini çalışalım

“Özdemir, biliyor musun, gazetelerde Musevi ya da Yahudi kökenli işadamı Moiz Milka diye yazdıkları zaman tüylerim diken diken oluyor. Ermeni kökenli, Rum kökenli de diyorlar. Sana Türk kökenli Türk yazarı mı diyorlar? Bizler Türkiye Cumhuriyeti vatandaşlarıyız. Senin adının önüne kökenle ilgili bir sıfat yapıştırılmıyorsa, bizlere de yapıştırılamaz. Biz yasalara göre her vatandaş gibi Türk’üz!”

Her harfi ara:
>> .

Sadece noktaları arar:
>> \.

Harf ya da kelimeden önceki ve/veya sonraki harfleri harf ya da kelime ile birlikte arar:
>> ..b.

"k" harfi geçenleri arar:
Sadece
>> k
yazsaydık aşağıdaki sonuca ulaşırdık:

Sadece
>> \bk
yazsaydık, ilk harfi "k" başlayan kelimelerdeki "k" harfleri seçilecek.

Sadece
>> \bSana\b
yazarsak, "Sana" kelimesi seçilir.

>> \bsana\b
S!=s (byte değerlerini düşünürsek) bu yüzden hiç bulunamayacaktır.

>> \b..rk\b
4 Harfli kelime seçiyoruz(..rk). İlk iki harfi herşey olabilir ama son iki harfi rk olacak. Sonuç:

>> [Bb]iz

[] içinde geçen harflerin olduğu yerleri seç.

[a-z] Küçük harflerin geçtiği yerleri seç.

[a-zA-z] Büyük küçük harflerin geçtiği yerleri seç.

[0-9] Rakamların geçtiği yerleri seç

[^öşç] öşç olmayanları(^) seç:

Hiçbirini seç:
>> ^

Yan yana 2 ya da 6 adet olan "o" harflerini seç:
>> o{2,6}

"x" harfi ile başlayan ve o ile devam eden(yan yana n tane "o" olmak şartıyla devam eden) harfleri seç:
>> xo+

o ile başlayan(yan yana n tane "o" olmak şartıyla devam eden) ve "x" harfi ile biten harfleri seç:
>> o+x

o ile başlaması şart değil ama yan yana n tane "o" ile birlikte, "x" harflerinide seç:
>> o*x

o+x : en az bir tane "o" harfi gerekiyor.
0?x : "o" karakteri olmasada olur. "o" karakteri yoksa sadece "x" ama varsa "x" den önceki tek "o" karakteri.
o*x : "o" karakteri olmasada olur ama olursa x den önceki tüm "o" larla birlikte "x" karakterini de seç.

>> <.+w

>> <.+?w

Satırın sonundaki harfi seç:
>> >$

Satır sonundaki herhangi tek karakteri seç.
>> .$

>> x.*$

Sadece yazılı kısmı arar:
>> vatan

|(boru) dan önceki kısmı ya da sonraki kısmı arar
>> Türk|Yahudi

o harfinden sonra o harfi olmayan. Ama sonraki n tane harfi seçen. < karakterini görünce durup, < karakteri hariç seçen. >> o(?!o).*(?=<)

Special Sequences

w - Any “word” character (a-z 0-9 _)

W - Any non “word” character

s - Whitespace (space, tab CRLF)

S - Any non whitepsace character

d - Digits (0-9)

D - Any non digit character

. - (Period) – Any character except newline

Meta Characters

^ - Start of subject (or line in multiline mode)

$ - End of subject (or line in multiline mode)

[ - Start character class definition

] - End character class definition

| - Alternates, eg (a|b) matches a or b

( - Start subpattern

) - End subpattern

- Escape character

Quantifiers

n* - Zero or more of n

n+ - One or more of n

n ? - Zero or one occurrences of n

{n} - n occurrences exactly

{n,} - At least n occurrences

{,m} - At most m occurrences

{n,m} - Between n and m occurrences (inclusive)

Pattern Modifiers

i - Case Insensitive

m - Multiline mode - ^ and $ match start and end of lines

s - Dotall - . class includes newline

x - Extended– comments and whitespace

e - preg_replace only – enables evaluation of replacement as PHP code

S - Extra analysis of pattern

U - Pattern is ungreedy

u - Pattern is treated as UTF-8

Point based assertions

b - Word boundary

B - Not a word boundary

A - Start of subject

Z - End of subject or newline at end

z - End of subject

G - First matching position in subject

Assertions

(?=) - Positive look ahead assertion foo(?=bar) matches foo when followed by bar

(?!) - Negative look ahead assertion foo(?!bar) matches foo when not followed by bar

(?<=) - Positive look behind assertion (?<=foo)bar matches bar when preceded by foo

(? - Negative look behind assertion (?

(?>) - Once-only subpatterns (?>d+)bar Performance enhancing when bar not present

(?(x)) - Conditional subpatterns

(?(3)foo|fu)bar - Matches foo if 3rd subpattern has matched, fu if not

(?#) - Comment (?# Pattern does x y or z)

Java'da UTF-8 ve Regex için sorulmuş ve cevaplanmış bir soru(güzel anlatım):

Re: Testing for UTF-8 with RegEx?

That regex is designed to match raw byte sequences, not character sequences. When you read the file, you're converting the bytes to characters according to the default encoding, then applying the regex to the character sequence as if it were still the undecoded bytes. For example, assuming the text was really encoded as UTF-8, the character 'ä' would be represented by the two-byte sequence C3 A4. If you decode that using the windows-1252 encoding, it becomes the two-character sequence "Ã¤" ("\u00C3\u00A4"). That matches the regex because the Unicode codepoints of the characters just happen to have the same value as their encodings. But that isn't the case with 'Ä'. Its UTF-8 encoding is C3 84, which becomes "Ã„" ("\u00C3\u201E") after decoding, and of course, that doesn't match the regex.

If you want to confirm that a file contains UTF-8 encoded text, you have to work on the raw bytes. That means you can't use regexes, because Java regexes only work with char sequences. But that's a poor approach anyway, in my opinion. Here's a much better way: http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding

Differences are (from this site):

\Q...\E escapes a string of metacharacters

.NET NO

Java YES

\Q...\E escapes a string of character class metacharacters (in a character sets)

.NET NO

Java YES

(?n) (explicit capture modifier)

.NET YES

Java NO

?+, *+, ++ and {m,n}+ (possessive quantifiers)

.NET NO

Java YES

(?<=text) (positive lookbehind)

.NET Full regex

Java Finite length

(?<!text) (negative lookbehind)

.NET Full regex

Java Finite length

Conditionals of form (?(?=regex)then|else), (?(regex)then|else), (?(1)then|else) or (?(group)then|else)

.NET YES

Java NO

(?#comment) comments

.NET YES

Java NO

Character class is a single token (Free-spacing syntax)

.NET YES

Java NO

\pL through \pC or \p{IsL} through \p{IsC} (Unicode properties)

.NET NO

Java YES

\p{IsLu} through \p{IsCn} (Unicode property)

.NET NO

Java YES

\p{InBasicLatin} through \p{InSpecials} or \p{IsBasicLatin} through \p{IsSpecials} (Unicode block)

.NET YES

Java NO

Spaces, hyphens and underscores allowed in all long names listed above (e.g. BasicLatin can be written as Basic-Latin or Basic_Latin or Basic Latin)

.NET NO

Java YES (Java 5)

Named captures of style (?<name>regex), (?'name'regex), \k<name> or \k'name'

.NET YES

Java NO

Multiple capturing groups can have the same name

.NET YES

Java N/A (does not have named capture groups)

XML character classes subtraction [abc-[abc]]

.NET YES (2.0)

Java NO

\p{Alpha} POSIX character class

.NET NO

Java YES (ASCII)

http://www.orhandogan.us/2008/12/11/duzenli-ifadeler-regular-expression-regex/
http://www.grymoire.com/Unix/Regular.html
http://weitz.de/regex-coach/
http://weitz.de/files/regex-coach.exe

Aklımda Kalası Kelimeler

18 Aralık 2014 Perşembe

RegExp üzerine derleme notlar

2 Kasım 2010 Salı

Javascript ile metin içinde url bağlantısını A etiketi ile sarmak

1 Kasım 2010 Pazartesi

Javascript ile Regular Expression

31 Aralık 2009 Perşembe

Regular Expression

Special Sequences

Meta Characters

Quantifiers

Pattern Modifiers

Point based assertions

Assertions